Closed flaneuse closed 4 months ago
OMICS-DI-ingested GEO dataset records will have identifiers that follow this format:
Merging will be challenging as there is metadata unique to each.
For example, https://data.niaid.nih.gov/resources?id=OMICSDI_PRJNA775608 and https://data.niaid.nih.gov/resources?id=GEO_GSE186705 appear to be duplicate entries. However (in this example):
A sampling/survey should be performed to determine the extent and consistency of these differences. The description for GEO datasets coming from OMICS-DI appear to have the 'name' field mirrored for 'description'; hence, 'description' will likely always be better from GEO.
Note that after the first-pass we can check for additional duplicates by matching metadata from the following fields:
Initially, it appeared that GEO has better values for:
In the following example of duplicates, we can see that the NDE's record of the dataset as ingested from OMICS-DI has a longer description than the NDE's record of the dataset as ingested from GEO. Further investigation reveals that the OMICS-DI record for this dataset does not match NDE's record of the dataset. Furthermore, the NDE's record as ingested from OMICS-DI actually aggregates additional information which is available in GEO's record of the dataset but not NDE's ingestion of the record from GEO
NDE's ingestion of the record from OMICS-DI
OMICS-DI's version of the record
NDE's ingestion of the record from GEO
GEO's version of the record
Resolving the differences between the handling of the 'description' field should enable us to merge the duplicates without worrying about which repository should override for conflicting 'descriptions'. Differences in the parsing may affect other metadata fields; however, it may not be as necessary to resolve those as they are expected to be unique between the repositories and should theoretically 'merge' without overwriting.
For the merge between OMICS-DI and GEO, the following fields may have conflicting data that should be resolved by fixes to the OMICS-DI and GEO parsers themselves:
The following fields may have data unique to ingestion from one repository vs the other. This can potentially be resolved by addressing the parsers; however, the fastest way to resolve this is to simply merge as the merger should keep the field that has a value.
The following fields should be handled by appending the values:
Per discussions the week of 2023.10.16, the GEO parser will be improved to have better descriptions. Both OMICS-DI and GEO will have the source names removed from the _id
prefix to enable merging of GEO datasets with OMICS-DI datasets (which apparently augment each dataset with a variableMeasured
value
Implemented, but awaiting build completion in staging
Available on staging
The number of duplicate records that are estimated to have been de-duplicated/merged are: 69782 between NCBI GEO and OMICS DI 290 between LINCS and OMICS DI
Once approved, we will move this to production
Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this approach sound promising. OK to move to production.
Please note: NIAID recognizes and is appreciative of the level of detail and would suggest that requests for NIAID feedback be focused more on outcome and impact rather than methods and approaches.
This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out
and will close after 1 week, if there are no further concerns about this issue.
OmicsDI harvests all the records from GEO, but with less metadata. Combine relevant metadata and de-duplicate the records. See also #40
Related WBS task
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/14