CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

Data profiles with only TXT resources should not be harvested #192

Closed higorspinto closed 4 years ago

higorspinto commented 4 years ago

Example: https://us-ed-testing.ckan.io/dataset/the-education-innovator-december-7-2010

We need to avoid harvesting data profiles that only have TXT files as resources, as they are very likely to be false positives.

We can deal with this in the sanitize transformer, rather than in the parser / Scrapy pipeline.

Jira Card

Acceptance Criteria

higorspinto commented 4 years ago

time spent: 4h