NCATSTranslator / ReasonerAPI

NCATS Biomedical Translator Reasoners Standard API
34 stars 28 forks source link

Standardize how metadata supporting text mined results is represented #399

Open mbrush opened 1 year ago

mbrush commented 1 year ago

Translator uses two main sources for text-mined knowledge: TMKP, and SemmeDB.

These sources want to report metadata supporting a text-mined edge, including the sentence(s) mined, metrics/scores reflecting confidence in accurate extraction of concepts and relationships form each sentence, and information about the context in which the sentence is found (e.g. what section of an article).

Often, a given edge is supported by mining of multiple sentences/spans of text - each of which comes with its own set of such metadata.

Precise representation of this information requires a way to group metadata for each NLP-based sentence analysis together in a TRAPI message.

The modeling team worked with TMKP to define a way to do this using Biolink StudyResult objects, and leveraging nested Attributes in the TRAPI structure. Details and examples of this model are here.

This modeling structure is reflected in how edge metadata is returned in the ARAX-ARS interface. Below I show a subset of the metadata on a 'is treated_by' edge from TMKP, which shows up in the KG supporting ARAGORN's 'Nutarsudil' result for this query: image image image

However, other KPs who provide text-mined edges from SemMedDB (BTE, RTX-KG2) return less detailed metadata . . . image (from https://arax.ncats.io/?r=623df483-e0c8-45b5-80bb-38f15627c93c, specifically a 'treated by' edge in ARAGORN's TOFACITIMIB result)

. . . and when more detail is provided, a very different structure is used. In the rtx2-semmed example below, sentence text and pub date are stuffed next to pmid in thepublications attribute for convenience, and then duplicated in a richer json format alongside score and date info in a separate bts:sentence attribute: image image (from https://arax.transltr.io/?r=9360c5c9-cb10-47d2-9910-535fc4cbbf05, specifically a 'treats' edge in ARAGORN's biguaniide result)


In summary, semmeddb edge metadata describing source publications, sentences and metadata about these (dates, scores, etc) are inconsistently provided and represented across KPs, and do not use the same detailed structure as TMKP.

We should try to use a similar structure in all cases, aligning where possible with that defined by TMKP.

edeutsch commented 1 year ago

@mbrush do we need to keep this issue open still or is it being tracked elsewhere?