Closed flaneuse closed 4 months ago
looks like the commonName
(or whatever we decide to call it) is at least partially covered by has_broad_synonym
: https://ontobee.org/ontology/NCBITaxon?iri=http://purl.obolibrary.org/obo/NCBITaxon_2697049
has_exact_synonym
can then map to alternateName
, split on ;
Using Pubtator to normalize species
and infectiousAgent
appears to be working well. Out of 50 mappings for species, Pubtator was unable to find mappings for 2 entries (which makes sense as the entries were for specific haplotypes of a strain of a species, rather than the strain itself--so more of an issue with the original data). It mapped 2 entries to more specific records than was warranted.
For healthCondition
, out of 51 mappings investigated, 7 had better entries in MeSH than what was actually mapped by PubTator. 5 were matches that weren't great due to the use of MeSH has a disease ontology (it's really not suitable for this, but everyone does it). These 5 matches are the best PubTator can do and is an issue with MeSH rather than PubTator.
That said, the number of values available for healthCondition
was quite limited as the original sources likely did not provide metadata for this field. One potential way to bypass this limitation is to harvest the MeSH terms from PMID records that cite the dataset. Medline/PubMed records include MeSH terms (curated by NLM biocurators), though additional checks would be needed for filtering the terms for species
, infectiousAgent
, and healthCondition
.
Per discussions during the week of 2023.10.16, Pubator annotations are being stored in a SQLite database as a table that links raw text to the corresponding ontology value. While manually overriding a 'dictionary entry' is possible, the ability to override, remove, or alter annotations at the record label will be valuable, especially for machine-generated annotations.
Additionally, Pubtator can annotate raw text, so we can gradually annotate and store the content of records with no citation.pmid. Being able to store, maintain, and edit annotations at the record level will be useful in this circumstance. Especially since users may report an incorrect annotation.
Note, due to outages with the PubTator raw text annotation API endpoint, we will shift away from PubTator in order to avoid delays in our builds. We have tested Text2Term and found differences of ~0.038% for mapping terms to NCBI Taxonomy.
The ask: The normalized species, healthCondition, and infectiousAgent values are now on staging. Can we move them to production?
Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this appraoch sound promising. OK to move to production.
Please note: this issue seems to have substantial overalp with #110 , if these are separate issues please provide information clarifying how these issues are distinct
Thanks @hartwickma, @sudavenk, @rshabman, @lisa-mml. To clarify, this issue covered the standardization of species
, infectiousAgent
, and healthCondition
text values ingested from the original data source. The overlap is because this standardization process is then used when we Augment (issue #110 ) records that have citations but do not have values for these fields.
@DylanWelzel will move the standardized and pubtator-augmented data to Production.
This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out
and will close after 1 week, if there are no further concerns about this issue.
Using the PubTator API:
species
andinfectiousAgents
to NCBI Taxonomyspecies
andinfectiousAgents
usingconcept=species
healthCondition
to MeSH identifiers usingconcept=disease
healthCondition
usingconcept=disease
.Notes:
species
, we'll want to storename
,alternateName
(synonyms from NCBI Taxonomy),url
(link to ontology),identifier
(numeric code).Related WBS Issue
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13