Standardize how metadata supporting text mined results is represented

Translator uses two main sources for text-mined knowledge: TMKP, and SemmeDB.

These sources want to report metadata supporting a text-mined edge, including the sentence(s) mined, metrics/scores reflecting confidence in accurate extraction of concepts and relationships form each sentence, and information about the context in which the sentence is found (e.g. what section of an article).

Often, a given edge is supported by mining of multiple sentences/spans of text - each of which comes with its own set of such metadata.

Precise representation of this information requires a way to group metadata for each NLP-based sentence analysis together in a TRAPI message.

The modeling team worked with TMKP to define a way to do this using Biolink StudyResult objects, and leveraging nested Attributes in the TRAPI structure. Details and examples of this model are here.

This modeling structure is reflected in how edge metadata is returned in the ARAX-ARS interface. Below I show a subset of the metadata on a 'is treated_by' edge from TMKP, which shows up in the KG supporting ARAGORN's 'Nutarsudil' result for this query:

However, other KPs who provide text-mined edges from SemMedDB (BTE, RTX-KG2) return less detailed metadata . . . (from https://arax.ncats.io/?r=623df483-e0c8-45b5-80bb-38f15627c93c, specifically a 'treated by' edge in ARAGORN's TOFACITIMIB result)

. . . and when more detail is provided, a very different structure is used. In the rtx2-semmed example below, sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute: (from https://arax.transltr.io/?r=9360c5c9-cb10-47d2-9910-535fc4cbbf05, specifically a 'treats' edge in ARAGORN's biguaniide result)

In summary, semmeddb edge metadata describing source publications, sentences and metadata about these (dates, scores, etc) are inconsistently provided and represented across KPs, and do not use the same detailed structure as TMKP.

The TMKP model provides rich metadata using Study Result object as organizing nodes in a two level structure.
In the bte-sememddb example , the KP doesn’t include sentence or other metadata at all.
In the rtx2-semmed example sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute. (Each object in this blob is analogous to a Study Result in the TMKP model).

We should try to use a similar structure in all cases, aligning where possible with that defined by TMKP.

NCATSTranslator / ReasonerAPI

Standardize how metadata supporting text mined results is represented #399