NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Normalization] Use PubTator API to normalize `species`, `infectiousAgent`, `healthCondition` #90

Closed flaneuse closed 4 months ago

flaneuse commented 1 year ago

Using the PubTator API:

  1. Standardize existing species and infectiousAgents to NCBI Taxonomy
  2. Extract species and infectiousAgents using concept=species
  3. Standardize existing healthCondition to MeSH identifiers using concept=disease
  4. Extract healthCondition using concept=disease.

Notes:

flaneuse commented 1 year ago

looks like the commonName (or whatever we decide to call it) is at least partially covered by has_broad_synonym: https://ontobee.org/ontology/NCBITaxon?iri=http://purl.obolibrary.org/obo/NCBITaxon_2697049

has_exact_synonym can then map to alternateName, split on ;

gtsueng commented 1 year ago

Using Pubtator to normalize species and infectiousAgent appears to be working well. Out of 50 mappings for species, Pubtator was unable to find mappings for 2 entries (which makes sense as the entries were for specific haplotypes of a strain of a species, rather than the strain itself--so more of an issue with the original data). It mapped 2 entries to more specific records than was warranted.

For healthCondition, out of 51 mappings investigated, 7 had better entries in MeSH than what was actually mapped by PubTator. 5 were matches that weren't great due to the use of MeSH has a disease ontology (it's really not suitable for this, but everyone does it). These 5 matches are the best PubTator can do and is an issue with MeSH rather than PubTator.

That said, the number of values available for healthCondition was quite limited as the original sources likely did not provide metadata for this field. One potential way to bypass this limitation is to harvest the MeSH terms from PMID records that cite the dataset. Medline/PubMed records include MeSH terms (curated by NLM biocurators), though additional checks would be needed for filtering the terms for species, infectiousAgent, and healthCondition.

gtsueng commented 8 months ago

Per discussions during the week of 2023.10.16, Pubator annotations are being stored in a SQLite database as a table that links raw text to the corresponding ontology value. While manually overriding a 'dictionary entry' is possible, the ability to override, remove, or alter annotations at the record label will be valuable, especially for machine-generated annotations.

Additionally, Pubtator can annotate raw text, so we can gradually annotate and store the content of records with no citation.pmid. Being able to store, maintain, and edit annotations at the record level will be useful in this circumstance. Especially since users may report an incorrect annotation.

gtsueng commented 7 months ago

Note, due to outages with the PubTator raw text annotation API endpoint, we will shift away from PubTator in order to avoid delays in our builds. We have tested Text2Term and found differences of ~0.038% for mapping terms to NCBI Taxonomy.

gtsueng commented 6 months ago

The ask: The normalized species, healthCondition, and infectiousAgent values are now on staging. Can we move them to production?

hartwickma commented 5 months ago

Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this appraoch sound promising. OK to move to production.

Please note: this issue seems to have substantial overalp with #110 , if these are separate issues please provide information clarifying how these issues are distinct

gtsueng commented 5 months ago

Thanks @hartwickma, @sudavenk, @rshabman, @lisa-mml. To clarify, this issue covered the standardization of species, infectiousAgent, and healthCondition text values ingested from the original data source. The overlap is because this standardization process is then used when we Augment (issue #110 ) records that have citations but do not have values for these fields.

@DylanWelzel will move the standardized and pubtator-augmented data to Production.

gtsueng commented 5 months ago

This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out and will close after 1 week, if there are no further concerns about this issue.