Improve Deduplication of Datasets

SITUATION

Based on dataset results from previous scraping runs, duplicates of datasets are still being scrapped and subsequently harvested into CKAN. These duplicates should be removed by an improved deduplication process.

TASKS

[x] inspect the source_url of sample duplicate datasets to identify why these duplicates escape the current deduplication process
[x] based on inspection/investigation of source_url identify the best method(s)/solution(s) for trapping these duplicates.
[x] transcribe the identified solution into flexible/reusable code which can be integrated into the current deduplication transformer with little change

ACCEPTANCE CRITERIA

[x] code is easily integrated into current deduplication transformer with little change
[x] more datasets duplicates are caught and removed by the improved deduplication

Current Sample of dataset duplication from source url

CivicActions / edscrapers

Improve Deduplication of Datasets #107

SITUATION

TASKS

ACCEPTANCE CRITERIA