NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Parser Fix]: Investigate propagation of missing SRA metadata from Experiment-level records #145

Open gtsueng opened 4 weeks ago

gtsueng commented 4 weeks ago

Issue Name

Investigate propagation of missing SRA metadata from Experiment-level records

Issue Description

Currently, SRA seems to be parsed from Study-level records. These records are missing key metadata fields which are desirable for inclusion in the NDE, including: species/infectiousAgent information. This information instead can be found in the nested/associated Experiment-level record in SRA.

To do: Determine if there is a way to propagate 'author' and 'species/infectiousAgent' metadata from the Experiment-level record up to the Study-level, as SRA records appear to be missing this information currently

Issue Example

Example record in NDE missing 'species'/'infectiousAgent' and 'author' information: https://data.niaid.nih.gov/resources?id=ncbi_sra_srp253552

Example Study-level record in SRA (also missing the crucial fields): https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP253552

Example Experiment-level record containing the relevant fields: https://www.ncbi.nlm.nih.gov/sra/SRX7964236[accn]

Related WBS task

For internal use only. Assignee, please select the status of this issue

Status Description

No response

gtsueng commented 4 weeks ago

@DylanWelzel It looks like the parser already has methods for pulling the species/infectiousAgent info and the author info, but they don't appear to be working.

gtsueng commented 4 days ago

@DylanWelzel from what I've seen on Staging, the standardization and delineation pipelines are working well for SRA; however, after the standardization and delineation, the original term remains in the species field even if a standardized version has been moved to the infectiousAgent field. This is makes it look like the term is duplicated for infectiousAgent and species.

Part of the problem may stem from the formatting of the ingest to the species field. It appears that NCBI SRA species info is formatted as "name": , "additionalType": {"name": "species", "url": <NCIT url for the term, species>}. This "additionalType" object might be interfering with the usual behavior of the pipeline (which would remove any standardized species that has been identified as an infectiousAgent).

Here are a few examples to facilitate the investigation of the issue:

DylanWelzel commented 4 days ago

The NCBI SRA fix is live on the staging api. The links above no longer include the original term in the species field.

gtsueng commented 3 days ago

Looks good, please push the updates to Production. I am marking this issue as 'pending close out' and will close it in a week if there are no further concerns