NCATSTranslator / Text-Mining-Provider-Roadmap

Roadmap and issue tracking for the NCATS Translator Text Mining Provider
MIT License
2 stars 2 forks source link

Add attribute to indicate agreement with SemMedDB for targeted assertions #96

Open bill-baumgartner opened 1 year ago

bill-baumgartner commented 1 year ago

Background

At the June 2022 Relay there was interest expressed by consortium members for the Text Mining Provider KP to include an indication when its targeted assertions matched with those provided by SemMedDB. Agreement between the two resources can be considered at varying levels. The two resources might agree at the sentence level, i.e., both resources mined an assertion from the same sentence of a given document. They might agree at the document level, i.e., both resources mined an assertion from the same document, but not necessarily from the same sentence in that document. Finally, the resources might agree at only the assertion level, i.e., one resource mined an assertion from one document while the other resource mined the same assertion from a different document altogether.

Some relevant links:

Proposed EPC metadata

We will return agreement information in the form of an attribute in the EPC metadata for each targeted assertion. The attribute will be on the assertion-level, at least initially, and will have nested fields to indicate the following:

Note that the proposed EPC fields above are PubMed-centric because SemMedDB is comprised of assertions mined from PubMed. TMKP contains assertions mined from PubMed as well as other sources.

Processing SemMedDB

In order to populate the proposed EPC metadata above, we will develop a pipeline to process SemMedDB. Expected challenges include mapping from the entity namespace used by SemMedDB (UMLS, I believe) to the OBO namespace used by TMKP. There have been previous efforts within the translator community to align SemMedDB with Biolink that may be of help. This notebook illustrates many of the modeling decisions that need to be made in order to make use of SemMedDB within the Translator ecosystem.

Output of the pipeline will be a database table with the following fields: | PMID | sentence_id | subject CURIE | predicate | object CURIE | where,

We note that SemMedDB has periodic releases, so the pipeline will be run whenever a new release is made available.