RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Is the answer to the default question sensible? #1131

Closed edeutsch closed 4 months ago

edeutsch commented 3 years ago

Our default question for the JSON example has not changed. But the QueryGraphInterpreter now sends it to KG2 by default instead of KG1 And I forcibly limit it to the top 50 to avoid bloat

The question is: is the set of responses and rankings that are returned for this default question sensible? Or are there glaring errors?

https://arax.ncats.io/beta/?m=3234

edeutsch commented 3 years ago

I see lots of SemMedDB edges. But I don't see any ChEMBL edges. Shouldn't we expect to see ChEMBL edges?

dkoslicki commented 3 years ago

I see PTGS1 and PTGS2 in there, and that’s the extent of my bio knowledge about what the answer should be (without looking into it further).

Re: ChEMBL edges, @saramsey would know, but I do see provided_by: identifiers_org_registry:chembl.compound, so maybe ChEMBL is being ingested from identifiers.org?

edeutsch commented 3 years ago

ah, yes, I see it now, thanks. identifiers_org_registry:chembl.compound is the CURIE that is being used to mean "provided by ChEMBL". Looks odd, but makes sense. There's no single namespace handle for ChEMBL as a whole I guess.

jaredroach commented 3 years ago

Alright, I'll bite. What's the question? When I click on https://arax.ncats.io/beta/?m=3234 I get this screen:

You have requested ARAX message id = 3234 Retrieving ARAX message id = 3234 Normal completion Rendering message...done.

So although the answer to the question might be sensible, I am not sure the question itself is sensible.

dkoslicki commented 3 years ago

The question is: what proteins is/does acetaminophen connected to/associated with/show more association in literature with?

Sent from my mobile device, please excuse my brevity and/or typos


From: Jared Roach notifications@github.com Sent: Saturday, November 21, 2020 2:19:51 AM To: RTXteam/RTX RTX@noreply.github.com Cc: David Koslicki dmkoslicki@gmail.com; Assign assign@noreply.github.com Subject: Re: [RTXteam/RTX] Is the answer to the default question sensible? (#1131)

Alright, I'll bite. What's the question? When I click on https://arax.ncats.io/beta/?m=3234 I get this screen:

You have requested ARAX message id = 3234 Retrieving ARAX message id = 3234 Normal completion Rendering message...done.

So although the answer to the question might be sensible, I am not sure the question itself is sensible.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/RTXteam/RTX/issues/1131#issuecomment-731521759, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABQROOHWVK3TBSSYVSMCEZDSQ5SZPANCNFSM4T5MBYCQ.

jaredroach commented 3 years ago

Poking a bit further, it looks like the question might be, "What proteins does acetominophen interact with?"

I think the answer list is reasonable. I have to wonder whether #3 refers to the sense of "Ache" that is synonymous with dull pain. Because if AChE means acetylcholinesterase, it is harder to track down the relationship. I wonder if the main knowledge driving this link is that if one overdoes on acetominophen, then a treatment for the resultant liver injury is acetylcholinesterase inhibitors.

Some of the interactions are with CYP genes that detoxify acetominphen so these are very reasonable.

the #1 hit to ALT is almost certainly due to the literature on acetominphen toxicity. "While acetaminophen overdose has been recognized as a cause of alanine aminotransferase (ALT) elevations for over 40 years..." So this is a pretty indirect mechanism. Tylenol kills the liver cells, which then leak a whole bunch o proteins, including ALT. So maybe it makes a ton of sense to return this as a #1 hit from Translator, but if one is expecting something along the lines of an interaction related to drug development thought processes, this seems distracting, not #1 important.

The above logic is also the source of the AST link #3. Aspartate aminotransferase

I don't think #7 is a protein. How did it end up as a protein node? gamma-glutamylcysteinylacetominophen maybe it is a dipeptide. Perhaps that counts as a protein. So my bad; let's call this a really good hit. But it comes back to the gray areas that were discussed in the node classification call we had a few weeks ago with the Data Representation committee.

And if we can call a dipeptide a protein, we might as well call individual amino acids proteins: L-cysteine zwitterion I am not sure how this ends up on the list. Maybe as a treatment for acetominphen overdose? yup. Google search confirms.

TL/DR: It sure would be nice to divide the results into two categories:

  1. proteins that are somehow conceptually related to acetominophen in the sense that when the concept of acetominophen is invoked in a clinician's mind; the concept of that protein is also invoked. e.g., because that protein is used as a treatment for or a diagnostic of Tylenol overdose.
  2. proteins that directly interact with acetominophen, particularly when acetominophen is present at normal medical doses / concentrations. This would allow one to predict what proteins and therefore pathways and therefore symptoms one might use acetominophen to treat - e.g., repurposing a drug. Or to predict side effects of normal usage. Because one usually doesn't have the levels of toxic acetominophen in them unless one has tried to commit suicide (unfortunately, all too common, and only a good way to destroy one's liver, not a good way to die).
dkoslicki commented 3 years ago

@jaredroach Re: the first part of you assessment: the “interacts with” flavor to this question is probably an artifact of the ranking. The question doesn’t actually specify and edge type (indicated below). I assume if you don’t filter to the top 50, on the bottom of the results, you might find other kinds of relationships.

{
  "edges": [
    {
      "id": "qg2",
      "source_id": "qg1",
      "target_id": "qg0"
    },
    {
      "id": "N1",
      "relation": "N1",
      "source_id": "qg0",
      "target_id": "qg1",
      "type": "has_normalized_google_distance_with"
    }
  ],
  "nodes": [
    {
      "curie": "CHEMBL.COMPOUND:CHEMBL112",
      "id": "qg0"
    },
    {
      "id": "qg1",
      "type": "protein"
    }
  ]
}
jaredroach commented 3 years ago

So I think the code is working brilliantly, as intended by us programmers. Not to say the result is wrong, but just to give the user's perspective. It is sort of like asking for literature relationships to "frog". If you are a biologist, you are anticipating results like "amphibian". So you get a little thrown when the best hits in English literature are to "Louis XIV".

edeutsch commented 3 years ago

I think we could improve in two ways initially. 1) Fix the node synonymizer to not conflate ache and AChE. I can do that pretty easily. 2) Rank known knowledge base interactions higher than SemMedDB associations so that they appear at the top of the list. I think we want known KB associations to outrank SemMedDB ones. I am uncertain on how to make that happen best, but I think we should try to figure out a way.

jaredroach commented 3 years ago

I was (I think) kidding a bit when I suggested AChE might be conflated with ache. I don't think that actually happened. It doesn't hurt to check the code though.

Changing rankings based on knowledge source makes sense. And has the advantage of being tunable/configurable. If someone was studying acetominophen overdosing, they could in theory elevate those results.

edeutsch commented 3 years ago

I understood you were kidding, but you were right! The node synonymizer did erroneously merge these. There are probably other cases.

edeutsch commented 4 months ago

closing ancient history.