SITUATION
The ouptut from the scraping processes currently generates some resources which satisfy the criteria for resource generation; However upon cursory inspection, these resources are NOT actual dataset resources (e.g. a site-map as .xslx file OR an image file stored in a zip). We refer to this as false positives
TASKS
[x] identify/create an effective methodology on how to identify false positives from scraping output
[x] Write script that implements effective methodology of false positives identification
ACCEPTANCE CRITERIA
[x] script must interface with scrapy
[x] Validate the removal of false positives from scrapy results
RELATED TO:
109 , #118 . Close this issue when the related issues are closed
SITUATION The ouptut from the scraping processes currently generates some resources which satisfy the criteria for resource generation; However upon cursory inspection, these resources are NOT actual dataset resources (e.g. a site-map as .xslx file OR an image file stored in a zip). We refer to this as false positives
TASKS
[x] identify/create an effective methodology on how to identify false positives from scraping output
[x] Write script that implements effective methodology of false positives identification
ACCEPTANCE CRITERIA
[x] script must interface with scrapy
[x] Validate the removal of false positives from scrapy results
RELATED TO:
109 , #118 . Close this issue when the related issues are closed