Closed FuhuXia closed 1 year ago
Could be resolved by #4007
After cleaned up the bad data in https://github.com/GSA/data.gov/issues/4007, above query returns 0 rows now.
Ooooo this was a fun one! I remember @FuhuXia and @Jin-Sun-tts pairing heavily on this to manually cleanup the DB and then Jin got proficient enough to do it herself! 🥲
In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:
The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.
How to reproduce
Cannot replicate.
Sketch
One time fix: collect all ids and delete all duplicates via API. Long term fix: improve de-dupe script to handle them.