[Augmentation] Use EXTRACT to perform species and infectiousAgent NER

The PubTator raw text annotation API would likely provide better results than the EXTRACT API, however, the PubTator raw text annotation API is subject to frequent outages which impacts our ability to move things forward in a timely fashion.

Rather than attempt to push names and descriptions of ~2.8 million or so records through the PubTator raw text annotation endpoint to identify species and health conditions, we will use the EXTRACT API from the Jensen lab in conjunction with Text2Term as low scoring matches from the latter will be helpful for reducing errors introduced by the former.

Pulling data from EXTRACT: https://github.com/gtsueng/nde_misc/blob/main/EXTRACT_check/JensenExtractTest.ipynb

Inspecting the quality of the results: https://github.com/gtsueng/nde_misc/blob/main/text2term_test/Text2Term%20Test.ipynb

Heuristics for improving accuracy based on the results of the EXTRACT text mapping

EXTRACT has high sensitivity but low precision for extracting terms with less than 5 characters due to the abundance of abbreviations
For 5 letter terms
- At a score of >0.95, the number of true positives is 91/127
- Of those 91 true positive terms, about 41 are terms which are part of a taxonomic phrase (i.e. - either only the genus part of a taxonomy or a species part
- This means that only about 41 of these terms would be missed with the number of letters threshhold for EXTRACT were >5, the remaining true positives would likely be captured with the whole term

Text2Term is also weaker at mapping terms formatted as g. species, as it can score higher mapping to the correct species term but incorrect genus. To address this:

Identify such terms using regex (r"\b[A-Z].\s[^\s]+\b")
- For these terms, only take the result if the genus letter matches the first letter of the mapped result (i.e. split on '.', take first)
- There will be plenty of exceptions, but this should address the majority of problematic matches

NIAID-Data-Ecosystem / nde-crawlers

[Augmentation] Use EXTRACT to perform species and infectiousAgent NER #120

Heuristics for improving accuracy based on the results of the EXTRACT text mapping