Add attribute to indicate agreement with SemMedDB for targeted assertions

Background

At the June 2022 Relay there was interest expressed by consortium members for the Text Mining Provider KP to include an indication when its targeted assertions matched with those provided by SemMedDB. Agreement between the two resources can be considered at varying levels. The two resources might agree at the sentence level, i.e., both resources mined an assertion from the same sentence of a given document. They might agree at the document level, i.e., both resources mined an assertion from the same document, but not necessarily from the same sentence in that document. Finally, the resources might agree at only the assertion level, i.e., one resource mined an assertion from one document while the other resource mined the same assertion from a different document altogether.

Some relevant links:

Kilicoglu et al. 2012
SemMedDB download (requires a UMLS Terminology Services (UTS) account)

Proposed EPC metadata

We will return agreement information in the form of an attribute in the EPC metadata for each targeted assertion. The attribute will be on the assertion-level, at least initially, and will have nested fields to indicate the following:

[boolean] true if SemMedDB contains this type of assertion, false otherwise, i.e., true if one might expect this kind of assertion to also appear in SemMedDB. If false, then many of the counts below will be zero.
[integer] count of this assertion reported by TMKP (will include PubMed and other sources)
[integer] count of this assertion reported by SemMedDB
[integer] count of PubMed records in TMKP that assert this assertion
[integer] count of PubMed records in SemMedDB that assert this assertion
[integer or %] number of PubMed records in TMKP & SemMedDB that both assert this assertion
[integer] count of sentences in PubMed records in TMKP that assert this assertion
[integer] count of sentences in PubMed records in SemMedDB that assert this assertion
[integer or %] number of sentences in PubMed records in TMKP & SemMedDB that both assert this assertion

Note that the proposed EPC fields above are PubMed-centric because SemMedDB is comprised of assertions mined from PubMed. TMKP contains assertions mined from PubMed as well as other sources.

Processing SemMedDB

In order to populate the proposed EPC metadata above, we will develop a pipeline to process SemMedDB. Expected challenges include mapping from the entity namespace used by SemMedDB (UMLS, I believe) to the OBO namespace used by TMKP. There have been previous efforts within the translator community to align SemMedDB with Biolink that may be of help. This notebook illustrates many of the modeling decisions that need to be made in order to make use of SemMedDB within the Translator ecosystem.

PMID = PubMed ID
sentence_id is a hash of the sentence. Currently we use a SHA256 hash of documentId + documentZone + entityId1 + entitySpan1 + entityId2 + entitySpan2 + sentenceText. We may need to reconsider this based on the information available in SemMedDB.
subject CURIE = the CURIE of the subject entity in the OBO namespace
predicate = the predicate in the Biolink namespace
object CURIE = the CURIE of the object entity in the OBO namespace

We note that SemMedDB has periodic releases, so the pipeline will be run whenever a new release is made available.

NCATSTranslator / Text-Mining-Provider-Roadmap