Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.
TASKS
[x] inspect the source_url of sample duplicate datasets to identify why these duplicates escape the current deduplication process
[x] based on inspection/investigation of source_url identify the best method(s)/solution(s) for trapping these duplicates.
[x] transcribe the identified solution into flexible/reusable code which can be integrated into the current deduplication transformer with little change
ACCEPTANCE CRITERIA
[x] code is easily integrated into current deduplication transformer with little change
[x] more datasets duplicates are caught and removed by the improved deduplication
SITUATION
Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.
TASKS
source_url
of sample duplicate datasets to identify why these duplicates escape the current deduplication processsource_url
identify the best method(s)/solution(s) for trapping these duplicates.ACCEPTANCE CRITERIA
Current Sample of dataset duplication from source url