GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
616 stars 98 forks source link

CKAN duplicate dataset (different harvest object) #3968

Closed jbrown-xentity closed 1 year ago

jbrown-xentity commented 2 years ago

Dataset duplicate, but created around the same time with different harvest object id's:

These even have the same harvest object id, but are harvested almost 24 hours apart from each other (but on the same job, at the time of this analysis only 1 harvest job has been run). There is no reason for this to occur, the code should be preventing it. The only reason I can imagine is if the harvest fails to mark this object as "complete", but it runs through everything up to that point (including actually creating the dataset), and it somehow re-enters the queue.

How to reproduce

  1. See links above, OR
  2. Go to recent datagov-dedupe check orgs for duplicates action
  3. Download file, pick an organization for evaluation (we picked fcc-gov)
  4. Run https://catalog.data.gov/api/action/package_search?fq=organization:fcc-gov&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2
  5. Check individual identifier using https://catalog.data.gov/api/action/package_search?q=identifier:%22https://opendata.fcc.gov/api/views/wxit-5xtm%22
  6. Note different harvest object id's, names, metadata created & modified, etc

Expected behavior

No duplicate datasets are harvested from a DCAT source (by design)

Actual behavior

Duplicates are created

Sketch

The goal of this ticket is to investigate and propose a resolution to this problem. This is a different outcome (the datasets are created in CKAN), but may be related to https://github.com/GSA/data.gov/issues/3943 and #3944 .

Once the solution is in place, a follow up will need to be done/ticket be created to

  1. Discover how many items this affects
  2. Clean up any affected records

This may require creating custom processes/scripts in https://github.com/GSA/datagov-dedupe

btylerburton commented 2 years ago

The script included in the above PR can be run periodically to check and compare output.

We can discuss as a team how we can improve the script and monitoring.

nickumia-reisys commented 2 years ago

Except for the identifiers that had errors from the script. This is a first pass of the results: output.txt

@btylerburton @FuhuXia @Jin-Sun-tts @jbrown-xentity @robert-bryson

FuhuXia commented 2 years ago

Great. We will do another output one week from today. We will compare the outputs and evaluate whether the duplicate-with-different-harvest-object still occurring in the current setup.

jbrown-xentity commented 1 year ago

We still have outstanding items in this category. See https://catalog.data.gov/api/action/package_search?fq=guid:%22B01_doppler_rt%22 I believe many of these items (https://catalog.data.gov/api/action/package_search?fq=type:dataset%20AND%20organization:noaa-gov&facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2&rows=0) are of the same class. Need to figure out if these are growing; if not then we can create a follow up ticket to remove the current duplicates. If they are growing, need to figure out why.

FuhuXia commented 1 year ago

Datajson type of dataset:

After running dedupe script for a few organizations, datajson type of duplicates has been stabilized and reduced to a manageable dataset count.

$ curl -s "https://catalog.data.gov/api/action/package_search?facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&fq=((*:*%20AND%20-organization:centers-for-disease-control-and-prevention)%20AND%20(collection_package_id:*%20OR%20*:*))" | jq '.result.facets.identifier | length'

78

considering there are some false positives with some common identifiers, the number is in the single digit.

XML type of dataset:

$ curl -s "https://catalog.data.gov/api/action/package_search?facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2" | jq '.result.facets.guid | length'
23870

64748-23870 = 40878

According to this calculation, we have 40878 duplicates. If we exclude organization noaa-gov, it is only 3 duplicate datasets. We can improve the process defined in #4007 to delete those datasets that have no current harvest_object, that will eliminate 90% duplicates and make further troubleshoot easier.

FuhuXia commented 1 year ago

All packages, which have with harvest objects but none current harvest object, are cleared.

SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id 
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;

 name | count
------+-------
(0 rows)

Data,json still have 77 packages with same identifier:

$ curl -s "https://catalog.data.gov/api/action/package_search?facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&fq=((*:*%20AND%20-organization:centers-for-disease-control-and-prevention)%20AND%20(collection_package_id:*%20OR%20*:*))" | jq '.result.facets.identifier | length'

77

XML type has 9339 packages remaining to be examined.

$ curl -s "https://catalog.data.gov/api/action/package_search?facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2" | jq '.result.facets.guid | length'
8558

17897 - 8558 = 9339
jbrown-xentity commented 1 year ago

To be clear, the ISO packages that need to be examined: some of the duplicates have more than 1 (at least 1/20 have 2 duplicates from a high level examination)... @FuhuXia I'm not totally sure I'm following the math above, not sure where the 17897 comes from.

FuhuXia commented 1 year ago

8558 is all guids that have minimum 2 packages

$ curl -s "https://catalog.data.gov/api/action/package_search?facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2" | jq '.result.facets.guid | length'
8558

17897 is the count of those packages, by adding all counts together

$ curl -sL 'https://catalog.data.gov/api/action/package_search?facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2' | jq -r '.result.facets.guid' | jq -r '[.[]]|add'
17897

Out of the 17897, we should keep 8558, and eliminate 17897-8558=9339

This 9339 is purely calculated based on api results. DB might have different count, but should be close.

FuhuXia commented 1 year ago

Did a dryrun dedupe on org noaa-gov, it can eliminate all 9339 duplicates

$ pipenv run python duplicates-identifier-api.py noaa-gov --api-key ### --newest --geospatial
...
Deduplicating guid=ww3_hawaii progress=(8558, 8558) organization=noaa-gov
...
Summary duplicate_count=9339 organization=noaa-gov

If we want to run the dedupe by harvest sources, here is the stats of the 8558 packages by harvest sources:

NCDDC OER: 1
NOAA/NESDIS/ncei/accessions: 3487
ioos: 5070
jbrown-xentity commented 1 year ago

I think running on the whole org is fine, it will probably take a while but not more than a day. I'm most concerned with the --newest flag; do we know if we want to keep the newest? The oldest? Were the created duplicates a mistake, and updates have occurred on the original (so we should keep the original)? Were the created duplicates actually meant to be a replacement but the originals didn't get deleted, so we should keep the newest? If we can at least find some examples of items that have been updated since being created, maybe we can hypothesize which one to keep...

FuhuXia commented 1 year ago

Running dedupe script to remove 9339 duplicate datasets from noaa-gov, keeping the last modified one. 24 hours estimated to finish.

FuhuXia commented 1 year ago

All datajson and XML type of duplicates are cleared, Proof.

All datasets without harvest object, or without any current harvest object, are cleared, verified by running query defined in https://github.com/GSA/data.gov/issues/4007#issuecomment-1317366807

chris-macdermaid commented 1 year ago

Great job!!!

I still see some duplicate CDC records that are a result of the same record with the same identifier being harvested from a CDC and an HHS source. For example https://catalog.data.gov/dataset?q=identifier:"https://data.cdc.gov/api/views/xdg2-nh8n"

This query shows the duplicate CDC and HHS records https://catalog.data.gov/api/3/action/package_search?fq=(organization:hhs-gov OR organization:centers-for-disease-control-and-prevention) AND type:dataset&facet.field=["identifier"]&facet.limit=-1&facet.mincount=2&rows=0

FuhuXia commented 1 year ago

Thanks @chris-macdermaid Identifier is not globally unique, therefore it is normal to have some duplicate count using the straightforward API call. I have a gist script to filter out duplicate packages from different harvest sources. Using this script we have 0 real datajson type of duplicates.

jbrown-xentity commented 1 year ago

We have more of these cropping up. See https://catalog.data.gov/api/3/action/package_search?fq=type:dataset&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0 for more complete list, but below are 2 DOI use cases:

Reopening this ticket for triage.

FuhuXia commented 1 year ago

Could not find duplicates for the two DOI use cases provided. It seems CKAN has fixed it on its own. We can close this if no more cases can be found.

jbrown-xentity commented 1 year ago

Here is one: https://catalog.data.gov/api/action/package_search?q=identifier:%22126cce50-ea04-482c-837a-0a04b4f4d05a%22&facet.field=[%22harvest_object_id%22,%20%22harvest_source_title%22] Same harvest source, same identifier, but multiple harvest_object_id's.

jbrown-xentity commented 1 year ago

That item is resolved. There are low enough duplicate numbers to mark this as closed for now...