GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
633 stars 100 forks source link

NUCA harvest job created duplicated datasets #3567

Closed FuhuXia closed 1 year ago

FuhuXia commented 2 years ago

In one of NCUA harvest jobs the harvester added all 36 datasets as new instead of updating existing ones. This resulted in duplicate datasets. With total of 60 datasets, 36 are newly harvested, 24 are duplicates. This is different from the other data.json duplicate issue #2981 in several ways:

  1. In DB, one of the duplicate dataset's package_id has no linked harvest_object.
  2. On the UI, the two duplicate datasets both have harvest_object info, pointing to the same harvest_object_id. Solr reindex does not help.
  3. De-dupe script does not work on them.

The following SQL script picks up the 24 duplicates from NCUA org, but it also shows it is a widely spread issue with other orgs too.

SELECT "group".name, COUNT(*) FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id
WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL
GROUP BY 1
ORDER BY 2 DESC
;
                      name                      | count
------------------------------------------------+-------
 doc-gov                                        | 23423
 noaa-gov                                       |  9749
 ca-gov                                         |  2868
 usaid-gov                                      |   508
 hhs-gov                                        |   132
 state-of-oklahoma                              |   100
 city-of-baltimore                              |    81
 doe-gov                                        |    64
 usgs-gov                                       |    48
 usda-gov                                       |    45
 city-of-new-york                               |    39
 epa-gov                                        |    33
 federal-laboratory-consortium                  |    29
 national-credit-union-administration           |    24
 vcgi-org                                       |    17
 city-of-sioux-falls                            |    14
 doi-gov                                        |    14
 dot-gov                                        |    13
 city-of-austin                                 |    11
 king-county-washington                         |     5
 ed-gov                                         |     3
 fema-gov                                       |     3
 centers-for-disease-control-and-prevention     |     2
 va-gov                                         |     2
 national-institute-of-standards-and-technology |     2
 state-of-connecticut                           |     2
 state-of-maryland                              |     2
 city-of-baton-rouge                            |     1
 rrb-gov                                        |     1
 census-gov                                     |     1
 city-of-bloomington                            |     1
 fcc-gov                                        |     1
 doj-gov                                        |     1
(33 rows)

How to reproduce

Cannot replicate.

Sketch

One time fix: collect all ids and delete all duplicates via API. Long term fix: improve de-dupe script to handle them.

hkdctol commented 2 years ago

Could be resolved by #4007

Jin-Sun-tts commented 2 years ago

After cleaned up the bad data in https://github.com/GSA/data.gov/issues/4007, above query returns 0 rows now.

nickumia-reisys commented 1 year ago

Ooooo this was a fun one! I remember @FuhuXia and @Jin-Sun-tts pairing heavily on this to manually cleanup the DB and then Jin got proficient enough to do it herself! 🥲