NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

De-duplicate / combine records between NCBI GEO + OmicsDI #43

Closed flaneuse closed 4 months ago

flaneuse commented 2 years ago

OmicsDI harvests all the records from GEO, but with less metadata. Combine relevant metadata and de-duplicate the records. See also #40

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14

gtsueng commented 10 months ago

OMICS-DI-ingested GEO dataset records will have identifiers that follow this format:

Merging will be challenging as there is metadata unique to each.

For example, https://data.niaid.nih.gov/resources?id=OMICSDI_PRJNA775608 and https://data.niaid.nih.gov/resources?id=GEO_GSE186705 appear to be duplicate entries. However (in this example):

A sampling/survey should be performed to determine the extent and consistency of these differences. The description for GEO datasets coming from OMICS-DI appear to have the 'name' field mirrored for 'description'; hence, 'description' will likely always be better from GEO.

Note that after the first-pass we can check for additional duplicates by matching metadata from the following fields:

gtsueng commented 8 months ago

Initially, it appeared that GEO has better values for:

In the following example of duplicates, we can see that the NDE's record of the dataset as ingested from OMICS-DI has a longer description than the NDE's record of the dataset as ingested from GEO. Further investigation reveals that the OMICS-DI record for this dataset does not match NDE's record of the dataset. Furthermore, the NDE's record as ingested from OMICS-DI actually aggregates additional information which is available in GEO's record of the dataset but not NDE's ingestion of the record from GEO

NDE's ingestion of the record from OMICS-DI image

OMICS-DI's version of the record image

NDE's ingestion of the record from GEO image

GEO's version of the record image

Resolving the differences between the handling of the 'description' field should enable us to merge the duplicates without worrying about which repository should override for conflicting 'descriptions'. Differences in the parsing may affect other metadata fields; however, it may not be as necessary to resolve those as they are expected to be unique between the repositories and should theoretically 'merge' without overwriting.

gtsueng commented 8 months ago

For the merge between OMICS-DI and GEO, the following fields may have conflicting data that should be resolved by fixes to the OMICS-DI and GEO parsers themselves:

The following fields may have data unique to ingestion from one repository vs the other. This can potentially be resolved by addressing the parsers; however, the fastest way to resolve this is to simply merge as the merger should keep the field that has a value.

The following fields should be handled by appending the values:

gtsueng commented 8 months ago

Per discussions the week of 2023.10.16, the GEO parser will be improved to have better descriptions. Both OMICS-DI and GEO will have the source names removed from the _id prefix to enable merging of GEO datasets with OMICS-DI datasets (which apparently augment each dataset with a variableMeasured value

gtsueng commented 8 months ago

Implemented, but awaiting build completion in staging

gtsueng commented 6 months ago

Available on staging

gtsueng commented 6 months ago

The number of duplicate records that are estimated to have been de-duplicated/merged are: 69782 between NCBI GEO and OMICS DI 290 between LINCS and OMICS DI

Once approved, we will move this to production

hartwickma commented 5 months ago

Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this approach sound promising. OK to move to production.

Please note: NIAID recognizes and is appreciative of the level of detail and would suggest that requests for NIAID feedback be focused more on outcome and impact rather than methods and approaches.

gtsueng commented 5 months ago

This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out and will close after 1 week, if there are no further concerns about this issue.