NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Augmentation] Use EXTRACT to perform species and infectiousAgent NER #120

Open gtsueng opened 7 months ago

gtsueng commented 7 months ago

The PubTator raw text annotation API would likely provide better results than the EXTRACT API, however, the PubTator raw text annotation API is subject to frequent outages which impacts our ability to move things forward in a timely fashion.

Rather than attempt to push names and descriptions of ~2.8 million or so records through the PubTator raw text annotation endpoint to identify species and health conditions, we will use the EXTRACT API from the Jensen lab in conjunction with Text2Term as low scoring matches from the latter will be helpful for reducing errors introduced by the former.

Pulling data from EXTRACT: https://github.com/gtsueng/nde_misc/blob/main/EXTRACT_check/JensenExtractTest.ipynb

Inspecting the quality of the results: https://github.com/gtsueng/nde_misc/blob/main/text2term_test/Text2Term%20Test.ipynb

Heuristics for improving accuracy based on the results of the EXTRACT text mapping

Text2Term is also weaker at mapping terms formatted as g. species, as it can score higher mapping to the correct species term but incorrect genus. To address this: