Closed gtsueng closed 4 months ago
The ask: These metadata enhancements are now on staging. May we move it to Production?
Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this appraoch sound promising. OK to move to production.
Please note: this issue seems to have substantial overalp with #90, if these are separate issues please provide information clarifying how these issues are distinct.
Thanks @hartwickma, @sudavenk, @rshabman, @lisa-mml. To clarify, this issue builds upon the standardization pipeline for species, infectiousAgent, and healthCondition text values ingested from the original data source (from #90 ). This ticket covers the pipeline which augments metadata records that have a citation PMID, but otherwise lack one of the three fields: species
, infectiousAgent
, or healthCondition
. Note that PubTator only provides species
field, and we are reliant on the pipeline described in delineation ticket in order to delineate between species
and infectiousAgent
. The delineation pipeline is also currently on staging and awaiting approval for Production. While it is technically possible to move this ticket and #90 forward to production without delineation, it would require a new build of the data
Thank you for this clarification. NIAID has provided responses to #90. Please clarify if there are additional outstanding issues to address prior to resolving this item
Thank you @hartwickma @rshabman @sudavenk @lisa-mml -- There are no more outstanding issues to resolve with regards to the metadata. We will move the metadata improvements to production.
This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out
and will close after 1 week, if there are no further concerns about this issue.
Background: NCBI’s PubTator tool is the best-in-class tool for extracting and normalizing taxonomy information from free text. This tool is regularly applied to all text in PubMedCentral and downloads of the results are available via FTP. These exports include PMIDs, the actual text that was extracted (the extracted name fo the species), and the corresponding NCIT IDs.
Problem: Many repositories do not include or require a ‘species’ field
Solution: Many records may lack a 'species' field, but include a ‘citation’ field. For records where a citation is available, but a species is not, we can use the PMID to pull species data from the PubTator/PMC taxonomy dumps.
Notes: To ensure that we are not getting extraneous species (from reagents in the Methodology section), we will limit our inclusion criteria to only the species where the extracted name appears in the ‘name’ or ‘description’ fields for the corresponding metadata record. We can then apply our delineation heuristic to further categorize the extracted taxonomic ids to 'species' vs 'infectiousAgent'.
Related WBS task
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13