Based on investigation & client feedback from previous scrapping runs, there are datasets being scraped that need to be treated/sanitized.
Sanitising methodology may vary based on the nature of the dataset e.g. datasets with 'photo; in title are to removed; datasets with 'conference' in title are to be tagged as 'private' and also set as private etc.
TASKS
[x] based on investigation and client feedback identify steps that can be used to sanitise affected datasets
[x] translate identified steps into usable algorithm and code
[x] Integrate code into already established scrapy process with minimal/no alteration to current process
ACCEPTANCE CRITERIA
[x] datasets are successfully sanitised based on agreed feedback
[x] the sanitising process integration introduced little or no disruptions to established scrapy process i.e. integration is smooth/flexible
PROBLEM LIST FOR SANITISING
handled all sanitising tasks listed here AND more here
SITUATION
Based on investigation & client feedback from previous scrapping runs, there are datasets being scraped that need to be treated/sanitized. Sanitising methodology may vary based on the nature of the dataset e.g. datasets with 'photo; in title are to removed; datasets with 'conference' in title are to be tagged as 'private' and also set as private etc.
TASKS
ACCEPTANCE CRITERIA
PROBLEM LIST FOR SANITISING