NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Augmentation] Use citation to Pubtator mappings to pull species and infectiousAgent metadata #110

Closed gtsueng closed 4 months ago

gtsueng commented 9 months ago

Background: NCBI’s PubTator tool is the best-in-class tool for extracting and normalizing taxonomy information from free text. This tool is regularly applied to all text in PubMedCentral and downloads of the results are available via FTP. These exports include PMIDs, the actual text that was extracted (the extracted name fo the species), and the corresponding NCIT IDs.

Problem: Many repositories do not include or require a ‘species’ field

Solution: Many records may lack a 'species' field, but include a ‘citation’ field. For records where a citation is available, but a species is not, we can use the PMID to pull species data from the PubTator/PMC taxonomy dumps.

Notes: To ensure that we are not getting extraneous species (from reagents in the Methodology section), we will limit our inclusion criteria to only the species where the extracted name appears in the ‘name’ or ‘description’ fields for the corresponding metadata record. We can then apply our delineation heuristic to further categorize the extracted taxonomic ids to 'species' vs 'infectiousAgent'.

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/13

gtsueng commented 5 months ago

The ask: These metadata enhancements are now on staging. May we move it to Production?

hartwickma commented 5 months ago

Thank you for the presentation on this issue from Scripps on 7 January 2024. The outcome from this appraoch sound promising. OK to move to production.

Please note: this issue seems to have substantial overalp with #90, if these are separate issues please provide information clarifying how these issues are distinct.

gtsueng commented 5 months ago

Thanks @hartwickma, @sudavenk, @rshabman, @lisa-mml. To clarify, this issue builds upon the standardization pipeline for species, infectiousAgent, and healthCondition text values ingested from the original data source (from #90 ). This ticket covers the pipeline which augments metadata records that have a citation PMID, but otherwise lack one of the three fields: species, infectiousAgent, or healthCondition. Note that PubTator only provides species field, and we are reliant on the pipeline described in delineation ticket in order to delineate between species and infectiousAgent. The delineation pipeline is also currently on staging and awaiting approval for Production. While it is technically possible to move this ticket and #90 forward to production without delineation, it would require a new build of the data

hartwickma commented 5 months ago

Thank you for this clarification. NIAID has provided responses to #90. Please clarify if there are additional outstanding issues to address prior to resolving this item

gtsueng commented 5 months ago

Thank you @hartwickma @rshabman @sudavenk @lisa-mml -- There are no more outstanding issues to resolve with regards to the metadata. We will move the metadata improvements to production.

gtsueng commented 5 months ago

This has been moved to Production on 2024.01.29. The status of the issue has been changed to pending close out and will close after 1 week, if there are no further concerns about this issue.