how to figure out text mining riddles?

gglusman commented 10 months ago

It frequently happens that a paper is given as support for an assertion, but it's hard to find entities in the paper since they use aliases / alternate names. What's the best way to figure this out? Here I show an example using the Translator UI (which users are expected to use) and then the ARAX UI (which presumably users won't be digging into).

2024/1/11 test of drugs that may upregulate NGLY1, the top result being S-adenosylmethionine, claimed to cause increased activity or abundance of MYT1L gene and 7 proteins, which in turn are claimed to upregulate NGLY1.

Digging into the support for 'MYT1L upregulates NGLY1'.

Looking into the paper, I recognize Png1 as an alias of NGLY1, but I can't find MYT1L (by any other name) in there.

The assertion comes via BTE, ultimately from TMKP.

This seems to indicate that MYT1L is mentioned in the sentence "Png1 preferentially deglycosylates misfolded proteins in vitro (Hirsch et al., 2004b; Joshi et al., 2005) and in cell extracts upon the overexpression of Png1 or glycoproteins (Hirsch et al., 2003)." I just don't see how. Is there any way to tell?

sierra-moxon commented 10 months ago

Hi @bill-baumgartner - is there some meta data here that we should be looking for to justify the TMKP answer? :) thanks in advance.

bill-baumgartner commented 9 months ago

In this specific case, Png1 is associated with the Protein Ontology records for both NGLY1 and MYT1L, and both are subsequently used to form an assertion -- clearly a disambiguation error on the part of TMKP.

In regards to the question about solving these kinds of riddles, yes, there is metadata to help the user understand which parts of the text were tagged as the subject and object, however the UI does not yet make use of this data. You can see it in the screenshot above. The biolink:subject_location_in_text and biolink:object_location_in_text attributes store character offsets into the sentence that supports the assertion. These offsets could be used to highlight the different parts of the sentence that are part of the assertion. If you use the ARAX interface, you can click on the value_url field for a given assertion, and it will open a browser that shows the sentence with highlighted entities.

gglusman commented 9 months ago

@bill-baumgartner That interface showing the details on the evidence is awesome. Thanks!!

NCATSTranslator / Feedback

how to figure out text mining riddles? #684