CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

Improve Deduplication of Datasets #107

Closed osahon-okungbowa closed 4 years ago

osahon-okungbowa commented 4 years ago

SITUATION

Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.

TASKS

ACCEPTANCE CRITERIA

Current Sample of dataset duplication from source url

osahon-okungbowa commented 4 years ago

Est ~ 6hrs to come up with a foolproof solution and test