NCATSTranslator / Text-Mining-Provider-Roadmap

Roadmap and issue tracking for the NCATS Translator Text Mining Provider
MIT License
2 stars 2 forks source link

Mappings to abstractions in the gene/gene-product hierarchy #81

Open bill-baumgartner opened 3 years ago

bill-baumgartner commented 3 years ago

Protein mentions that are automatically identified in text by the Text Mining Provider infrastructure are typically annotated to a species-non-specific class from the Protein Ontology (when one is available). Mapping to a more abstract concept has been shown to greatly improve inter-annotator agreement for the manual annotation task as determining the correct species for a protein mention can often be difficult (even for humans). However when put into practice, e.g. through the text-mined assertion KG provided by the Text Mining Provider, it has become evident that the use of these abstractions creates a disconnect between the contents of the text-mined assertion KG and the rest of the Translator ecosystem which makes use of species-specific identifiers. This is a problem that needs to be addressed.

A related problem involves mapping from a gene in a query to the protein encoded by the gene. Distinguishing between gene and protein mentions in text is also a difficult task (even for humans). It is often unclear whether the author is referring to the gene or the protein. The text-mined assertion KG conflates the two concepts, and although it makes use of identifiers from the Protein Ontology, the mentions should be considered as representing the biolink:GeneOrGeneProduct class. Note: This issue may be addressed by a fix-it session in the upcoming May relay.

Both of the issues described above play a role in the return of zero hits for the query described in https://github.com/NCATSTranslator/testing/issues/28. In order to successfully mine assertions from the text-mined assertion KG for the Chemical substances that "down regulate" STK11 query the following mappings are required:

In short, replacing HGNC:11389 with PR:000015740 in the query should result in a non-empty result set from the text-mined assertion KG.

mikebada commented 3 years ago

We need to somehow make sure that the queriers are aware that this is being done, i.e., that we'll be returning results for both genes and gene products corresponding to the inputted entity and also for potentially all 1:1 orthologs of the inputted species-specific entity.