Duplicate datasets on catalog

jbrown-xentity commented 2 years ago

Catalog created a bunch of duplicate datasets via harvest. Needs to be corrected.

How to reproduce

Check for dupes: https://github.com/GSA/datagov-dedupe

Expected behavior

No duplicates

Actual behavior

Lots of duplicates

Sketch

Update code so no more duplicates occur: https://github.com/GSA/ckanext-datajson/pull/120

Now we need to run de-dupe on all organizations.

Tested and confirmed on GSA org.

Can step through other organizations that have seen major changes from here: https://catalog.data.gov/api/action/package_search?q=metadata_modified:[2022-08-04T00:00:00Z+TO+NOW]&sort=metadata_modified%20desc&facet.field=[%22organization%22]

Eventually want to see no duplicates across the platform.

jbrown-xentity commented 2 years ago

Even though we utilized #3918 , there are still duplicate datasets from the old dump. Plugging through those now, starting with DOI... https://gsa-tts.slack.com/archives/C2N85536E/p1660231323964859

jbrown-xentity commented 2 years ago

Kicked this off again yesterday after being out for a week; we are about 35-40% done (but errors occur often).

jbrown-xentity commented 2 years ago

Leaving this here: the script got stuck on 42 datasets that existed in SOLR but not in the DB. CKAN can't update datasets that don't exist in the DB, and there's no way via CKAN API to manage these datasets other than find them via search. They had to be removed via the CKAN CLI, using the search-index clear functionality (that is dangerous since if the dataset isn't specified, it can clear the whole index). Now that these few are gone, the de-dupe process is clearing ~8 per minute, and the doi de-dupe should finish up by the end of today.

jbrown-xentity commented 2 years ago

The script continued to run last night, and it crashed on a 404 error (Similar to the other 42 errors). Before it restarted, it had 1,500 left. After the restart, it had 11K. I confirmed that DOI had a harvest run yesterday afternoon, and duplicates were created. We already have a test case for this, and the tests pass. That means something is weird with these datasets that are stored; somehow the duplicate issue is recurring. I don't know if it's a special case (related to DOI data), or if just that the fetch jobs are stepping on each other and we need to reconsider https://github.com/GSA/ckanext-datajson/pull/94. I think we need to create an investigation ticket into this, to discover how wide-spread the problem is and design a repro case such that we can find where the bug occurs.

jbrown-xentity commented 2 years ago

Through log analysis, I can confirm that the gather process is creating a new harvest object for datasets that already exist on the website. Consider identifier 02b6c78d-945b-4517-b685-9060f0bf0e31:

They have the same harvest source information, same unique identifier, same title, same source hash. They have a unique name and a unique harvest object.

The logs show that the logic isn't working for this code:

2022-08-26 08:05:16,108 INFO  [ckanext.datajson.datajson_ckan_28] Check existing dataset: 02b6c78d-945b-4517-b685-9060f0bf0e31
2022-08-26 08:05:16,109 INFO  [ckanext.datajson.datajson_ckan_28] Datajson creates a HO: 02b6c78d-945b-4517-b685-9060f0bf0e31

We are currently harvesting DOI on dev and locally, to see if the problem is reproducible. We do attempt to test the re-harvest logic, both at the ckanext-datajson extension level and at the catalog.data.gov level and all tests currently pass with no duplication.

jbrown-xentity commented 2 years ago

Locally, I was able to harvest DOI (I think if I tried a bigger source locally my machine would choke). Somehow, on the first harvest, it duplicated all the datasets (minus a few). However, the harvest source only reports harvesting a single set of records: 28,667. This is because that's how many harvest objects were created by the gather process to be harvested. I ran this using multiple catalog-fetch commands. I did see this error message pop up a few times, still not sure if that's meaningful or not.

I'm going to clear the harvest and restart with just one job, and see if the duplicates persist. If not, I'm going to recommend that we move to a single harvest source for now.

When I reharvest, no new duplicates are created. Just the normal additions, edits, etc. 🤷

jbrown-xentity commented 2 years ago

Was able to repurpose much of the datagov-dedupe code to add functionality to report duplicates across organizations, and run a github action to check this on-demand. See https://github.com/GSA/datagov-dedupe/actions/runs/2952078797

Current summary:

33 organizations have duplicates
141,085 duplicate records exist (from 369,687), or 38% of all records
113K duplicate records are from just DOI and NOAA. DOI uses DCAT-US harvester, while NOAA uses ISO WAF. These share almost no code in the gathering and fetch process (checking what should be importing, and actually importing)
DOI has more duplicates (56,435) than regular records (29629)

jbrown-xentity commented 2 years ago

Development harvested cleanly, and re-harvested fine without duplicates. Investigating specifically DOI and the logs, was able to find that the gather process ran twice, once at 2022-08-25T19:58:00 and another at 2022-08-26T07:57:03, for the same job. Investigating logs seems to show catalog-gather restarting often, as much as every half hour. This is suspicious, as we are not regularly restarting it. I see exit status of 0, 1, 143, and 137 but there may be others. There are 7 exits/crashes that occur between these 2 gather process log statements, most of which contain multiple error codes at the same time (first example has records of 143, then 0, then 137, all within 1 second of each other). Code 137 is out of memory, it's possible we need to bump from 3G to 4G to make it more stable.

I've also discovered that the harvest-run job (that processes if a harvest job is complete) actually re-requests items that hard failed and are in an incomplete state to be re-tried 5 times, which is why DOI takes so long to complete (there's generally around 20+ hard failures, but if each of these takes 1-20 minutes to recover and restart, and do all of them 5 times, this takes FOREVER).

FuhuXia commented 2 years ago

@jbrown-xentity we are restarting gather every 30 mins. https://github.com/GSA/catalog.data.gov/blob/main/.github/workflows/restart.yml#L76

jbrown-xentity commented 2 years ago

After clearing and reharvesting, about 200 datasets were still duplicated. Not on the level of complete duplication at this point. Re-harvest should occur on Wed, and we'll know more after that. Link to check number of duplicates: https://catalog.data.gov/api/3/action/package_search?fq=organization:doi-gov%20AND%20type:dataset&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0

jbrown-xentity commented 2 years ago

An initial analysis of the duplicate list for DOI is interesting. We already have multiple types:

SOLR duplicate that doesn't exist in SOLR
- https://catalog.data.gov/api/action/package_search?q=identifier:%2217253338-966d-43d8-975e-444a0a4ce05c%22
- https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-24977 (first dataset, exists normally)
- https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-c5c0c (second dataset, 404 not found)
- Note that these have different names; it's not a bug in SOLR that is causing this duplicate. CKAN is for some reason creating it twice, but only on SOLR
Perfect duplicate (other than name):
- https://catalog.data.gov/api/action/package_search?q=identifier:%221f37eda1-7540-4eb8-8708-33c7194c1c31%22
- https://catalog.data.gov/dataset/winter-precipitation-maps-rcp-6-0-inches-20224
- https://catalog.data.gov/dataset/winter-precipitation-maps-rcp-6-0-inches
- These even have the same harvest object id, but are harvested almost 24 hours apart from each other (but on the same job, at the time of this analysis only 1 harvest job has been run). There is no reason for this to occur, the code should be preventing it. The only reason I can imagine is if the harvest fails to mark this object as "complete", but it runs through everything up to that point (including actually creating the dataset), and it somehow re-enters the queue.

Please note that there may be other types of duplicates, but finding others may require a more detailed analysis.

jbrown-xentity commented 2 years ago

We will re-evaluate the duplicate count and items after a re-harvest, to determine if these situations are replicable and/or if there are different duplicate types that arise after the initial data is already there.

jbrown-xentity commented 2 years ago

The duplicate count is the same at 163. Attached are the list of duplicate ID's for DOI.

We will consider this research complete. I will make 2 tickets, 1 for each duplicate type that we have found to investigate further how it may be occurring.

doi-duplicates.txt

GSA / data.gov