RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

ARAX returning results with predicates that are not supported by Biolink Model 3.1.2 #1967

Closed edeutsch closed 4 months ago

edeutsch commented 1 year ago

I looked into this very briefly so I thought I'd start an issue locally.

Related to this: https://github.com/NCATSTranslator/Feedback/issues/118

It seems like there are two possibilities: 1) KG2.8.0 contains "gene_associated_with_condition", which apparently it should not 2) They are querying production KG2.7.6 instead of CI KG2.8.0

The query in question appears to be: 827077 2023-02-15 19:42:47 40 ars.ci.transltr.io 52.4.10.150, 10.11.0.190 arax.ci.transltr.io arax-0 ARAX 5094 129227 ✓ Completed OK Normal completion with 24 results.

Are the result is https://arax.ncats.io/?r=129227 https://arax.ncats.io/api/arax/v1.3/response/129227

As far as I can tell they are querying CI with KG2.8.0, so I think that rules out option 2. Although I am not really certain.

There is also potentially option 3: 3) As far as we were aware, the ask was for gene-chemical associations: image not for gene-disease associations. So this is not yet done.

That's about all I know. Tossing this out for others more knowledgeable..

saramsey commented 1 year ago

Hi @edeutsch thank you for reporting this issue in the RTX project area so we can track it locally. I guess my first order of business is to try and ascertain, when a query is posted on ui.ci.transltr.io, which ARAX endpoint is queried? How would we know?

edeutsch commented 1 year ago

I'm uncertain that I understand the question correctly, but the endpoint is likely to be /async_query. But could be /query. I think an easy way to know is to look at the second log entry, and if it is "asynchronous Query launching on incoming Query", then you know that the endpoint was /async_query

saramsey commented 1 year ago

Thanks @edeutsch. I didn't phrase my question well. I meant, what ARAX API base URL is being hit? How do we know that it is an ARAX installation that configured to query RTX-KG2.8.0c?

edeutsch commented 1 year ago

Well, you can see in the OP that arax.ci.transltr.io is being hit. And arax.ci.transltr.io should be hitting KG2.8.0c. But it is not trivial to be sure. I sifted through the logs to find a tell and did not come up with one. I thought it listed in the logs exactly which KP endpoint URLs are being bit, but I didn't see it there. Maybe I just missed it.

I just went to arax.ci.transltr.io and issued our example query and only 17 results came back, which I think is the surest sign that arax.ci.transltr.io is using KG2.8.0c right now (KG2.7.6c returns >100). But was it 2.8.0c when the initial query was sent? I'm not 100% certain. I think so, but there is room for uncertainty here. Maybe we need to consider augmenting the log messages to make it more clear.

saramsey commented 1 year ago

Well, you can see in the OP that arax.ci.transltr.io is being hit. And arax.ci.transltr.io should be hitting KG2.8.0c. But it is not trivial to be sure. I sifted through the logs to find a tell and did not come up with one. I thought it listed in the logs exactly which KP endpoint URLs are being bit, but I didn't see it there. Maybe I just missed it.

I just went to arax.ci.transltr.io and issued our example query and only 17 results came back, which I think is the surest sign that arax.ci.transltr.io is using KG2.8.0c right now (KG2.7.6c returns >100). But was it 2.8.0c when the initial query was sent? I'm not 100% certain. I think so, but there is room for uncertainty here. Maybe we need to consider augmenting the log messages to make it more clear.

Thanks for your reply. Because I cannot remember any of these details, I took a look at this Google Sheet which seems to aim to disambiguate the various ARAX instances:

https://docs.google.com/spreadsheets/d/1eC3GrRW6gw5zn7XKjvaHD9KCulO-GqEaTJ3mG6PgOOY/edit#gid=0

Screen Shot 2023-02-17 at 1 13 00 PM

Isn't KG2.8.0c only served up in development maturity level ARAX installations?

saramsey commented 1 year ago

I would imagine that arax.ci.transltr.io would be querying the RTX-KG2 service on kg2.ci.transltr.io. As far as I know (but maybe I am out-of-date), that instance is running code from the master branch of the RTX project area, whereas I thought that, in the RTX project area, all the KG2.8.0c stuff was in the kg2-integration branch. Paging @amykglen and @acevedol for a sanity check on what I just wrote.

saramsey commented 1 year ago

Ah, wait, I see that the master branch of RTX has file code/config_dbs.json which is clearly referencing KG2.8.0c,

Screen Shot 2023-02-17 at 1 18 41 PM
edeutsch commented 1 year ago

It is my understanding:

I would imagine that arax.ci.transltr.io would be querying the RTX-KG2 service on kg2.ci.transltr.io.

correct.

As far as I know (but maybe I am out-of-date), that instance is running code from the master branch of the RTX project area,

correct.

whereas I thought that, in the RTX project area, all the KG2.8.0c stuff was in the kg2-integration branch.

This is true, but it is also now in master. So anything running master should have KG2.8.0.

Paging @amykglen and @acevedol for a sanity check on what I just wrote.

saramsey commented 1 year ago

Thanks @edeutsch. So, the view from the synonymizer supports your expectation that arax.ci.transltr.io is indeed backed by the RTX-KG2.8.0c KP:

Screen Shot 2023-02-17 at 1 26 30 PM
saramsey commented 1 year ago
  1. KG2.8.0 contains "gene_associated_with_condition", which apparently it should not

Well, gene associated with condition is in the Biolink 3.0 spec (the version of Biolink against which RTX-KG2.8.0pre was built). Here is the permalink: https://github.com/biolink/biolink-model/blob/1efe2ed5a738f9cb4c32566f8cb7e713f62fa1ab/biolink-model.yaml#L4420

That predicate is also not deprecated:

Screen Shot 2023-02-17 at 1 33 29 PM
saramsey commented 1 year ago

Note, the predicate gene associated with condition is also in Biolink 3.1.2, and also (in that release) not deprecated:

Screen Shot 2023-02-17 at 1 36 18 PM

Also, I note that in NCATSTranslator/Feedback issue 118, they didn't boldface gene_associated_with_condition, so I am not sure that is the predicate they were raising an issue about?

Screen Shot 2023-02-17 at 1 36 53 PM
saramsey commented 1 year ago

Now, as to KCNH3 [increases_activity_of] Gentamycin, that's another story. That edge does indeed seem to be coming from "RTX-KG2",

Screen Shot 2023-02-17 at 1 38 23 PM
edeutsch commented 1 year ago

ah, sorry, my error, I mis-inferred which predicate they were unhappy about

saramsey commented 1 year ago

But RTX-KG2.8.0pre doesn't have that predicate!

Screen Shot 2023-02-17 at 1 40 30 PM

And yes, in the above query, I am talking to RTX-KG2.8.0pre:

Screen Shot 2023-02-17 at 1 41 06 PM
saramsey commented 1 year ago

So, my friends, we have a bit of a mystery on our hands. We have an ARAX result-set that supposedly comes from a query being executed via ui.ci.transltr.io,

https://arax.ncats.io/?r=cccb6699-29f4-492b-9733-ab77bd1a8261

that pulls up a collection of edges that includes a Biolink 2.X.X-era predicate (i.e., biolink:increases_activity_of). Which seems to imply that somewhere under the hood, that query is being serviced by an RTX-KG2 KP that is backed by KG2.7.6c and not KG2.8.0c. But how?

saramsey commented 1 year ago

Here is the JSON query:

{
  "edges": {
    "N1": {
      "attribute_constraints": [],
      "object": "sn",
      "predicates": [
        "biolink:has_normalized_google_distance_with"
      ],
      "qualifier_constraints": [],
      "subject": "on"
    },
    "creative_DTD_qedge_0": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "creative_DTD_qnode_0",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "sn"
    },
    "creative_DTD_qedge_1": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "creative_DTD_qnode_1",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "creative_DTD_qnode_0"
    },
    "creative_DTD_qedge_2": {
      "attribute_constraints": [],
      "exclude": false,
      "object": "on",
      "option_group_id": "creative_DTD_option_group_0",
      "qualifier_constraints": [],
      "subject": "creative_DTD_qnode_1"
    },
    "t_edge": {
      "attribute_constraints": [],
      "knowledge_type": "inferred",
      "object": "on",
      "predicates": [
        "biolink:treats"
      ],
      "qualifier_constraints": [],
      "subject": "sn"
    }
  },
  "nodes": {
    "creative_DTD_qnode_0": {
      "constraints": [],
      "is_set": true,
      "option_group_id": "creative_DTD_option_group_0"
    },
    "creative_DTD_qnode_1": {
      "constraints": [],
      "is_set": true,
      "option_group_id": "creative_DTD_option_group_0"
    },
    "on": {
      "categories": [
        "biolink:Disease"
      ],
      "constraints": [],
      "ids": [
        "MONDO:0007972"
      ],
      "is_set": false
    },
    "sn": {
      "categories": [
        "biolink:NamedThing"
      ],
      "constraints": [],
      "is_set": false
    }
  }
}
saramsey commented 1 year ago

I am posting this DSL query to arax.ci.transltr.io right now, to see what we get:

add_qnode(ids=CHEMBL.COMPOUND:CHEMBL643, key=n0)
add_qnode(categories=biolink:Protein, key=n1)
add_qedge(subject=n0, object=n1, key=e0)
expand(kp=infores:rtx-kg2)
resultify()
filter_results(action=limit_number_of_results, max_results=100)

I am hoping to pull up the CHEMBL.COMPOUND:CHEMBL643--UniProtKB:Q9ULD8 edge so I can examine the predicate. Fingers crossed....

saramsey commented 1 year ago

So, here is the edge in question. Note the Biolink-3.0-compatible predicate:

Screen Shot 2023-02-17 at 2 10 25 PM
saramsey commented 1 year ago

And if we click on the edge, we see:

Screen Shot 2023-02-17 at 2 11 01 PM
saramsey commented 1 year ago

This latest evidence motivates me to ask, what exactly tells us that ui.ci.transltr.io is querying arax.ci.transltr.io specifically (and not some other ARAX instance) via the ARS?

edeutsch commented 1 year ago

aha, wait a sec. The query_graph you show above has lots of "creative_DTD" stuff in it (that I don't fully understand).. what if this edge is coming out of DTD results? and not from KG2.8.0c itself?

saramsey commented 1 year ago

Would DTD be giving an edge with 'biolink:increases_activity_of` as the predicate?

saramsey commented 1 year ago

Wouldn't DTD leave some trace in the edge attributes (like EPC type stuff?), that it was a predicted edge?

edeutsch commented 1 year ago

I don't know. Perhaps my idea is preposterous. But sometimes that's all I've got. Quite often, actually.. I think we may need @amykglen and @chunyuma to come to the rescue..

saramsey commented 1 year ago

So, I have confirmed that in KG2.7.6c, the old (Biolink-2.X.X-era) predicate is there:

Screen Shot 2023-02-17 at 3 36 57 PM
saramsey commented 1 year ago

At this point, I am fairly confident that, somehow, the cached result posted in NCATSTranslator/Feedback issue 118 is showing an edge from RTX-KG2.7.6c.

amykglen commented 1 year ago

so I think because the original query was a knowledge_type: inferred query, I don't think RTX-KG2 is actually being queried directly as a KP (at least in the usual sense).

I think it's XDTD that added all those KG2 edges, as @edeutsch suggested - and I don't know the details of how XDTD does that. not sure if it queries KG2 at runtime? but certainly those edges are from an older KG2 version. @chunyuma or @dkoslicki would know the details about where those edges are coming from.

of note, the edges added by XDTD seem to be lacking attributes (except for one attribute each, indicating they came from KG2); that probably isn't ideal. at the very least we should be 'tagging' those edges to make it clear they were added by XDTD, I'd think?

saramsey commented 1 year ago

so I think because the original query was a knowledge_type: inferred query, I don't think RTX-KG2 is actually being queried directly as a KP (at least in the usual sense).

I think it's XDTD that added all those KG2 edges, as @edeutsch suggested - and I don't know the details of how XDTD does that. not sure if it queries KG2 at runtime? but certainly those edges are from an older KG2 version. @chunyuma or @dkoslicki would know the details about where those edges are coming from.

of note, the edges added by XDTD seem to be lacking attributes (except for one attribute each, indicating they came from KG2); that probably isn't ideal. at the very least we should be 'tagging' those edges to make it clear they were added by XDTD, I'd think?

Thank you, @amykglen. Agreed on all points.

saramsey commented 1 year ago

Hi @chunyuma when you have a moment, could you please weigh in on this? We want to know if the DTD module could be adding edges to the query-specific KG, with predicates like biolink:increases_activity_of.

dkoslicki commented 1 year ago

Sorry for the delay @saramsey and @amykglen. The xDTD model does not query KG2 at run-time. This is likely a problem due to training the xDTD model on a previous version of KG2 and predicates have since changed. @chunyuma said he can look into how to address that. Any idea what the priority level is for this?

chunyuma commented 1 year ago

Hi @saramsey @edeutsch @amykglen, I'm sorry for the late response. We did introduce some non-biolink predicates into the system. I mentioned this issue previously but it seems not a big issue. However, xDTD (it uses biolink:treats) or xCRG (it uses bioink:regulates) doesn't use the predicates (like biolink:increases_activity_of or biolink:decreases_activity_of). They seem to be in the KG2c for a while because I saw them in many scripts (see here, here, and here).

Here are a few non-predicates introduced by DTD, and some Overlay modules:

DTD (e.g. predict_drug_treats_disease.py) model introduces biolink:probably_treats FET (e.g. fisher_exact_test.py) introduces biolink:has_fisher_exact_test_p_value_with compute_jaccard introduces biolink:has_jaccard_index_with overlay_clinical_info.py introduces biolink:has_real_world_evidence_of_association_with

I don't know if these predicates are problems or not, please let me know if we need to change them. Thanks!

saramsey commented 1 year ago

Sorry for the delay @saramsey and @amykglen. The xDTD model does not query KG2 at run-time. This is likely a problem due to training the xDTD model on a previous version of KG2 and predicates have since changed. @chunyuma said he can look into how to address that. Any idea what the priority level is for this?

Thanks, David. Are you positing that DTD, having been built against KG2.7.6c, is putting KG2.7.6c edges into the query-specific KG? What we are seeing is edges with KG2.7.6c predicates, like biolink:increases_activity_of, showing up in results. My suspicion all along has been that this is likely happening because, somehow, the ARAX on arax.ci.transltr.io is querying a RTX-KG2 KP instance that is backed by KG2.7.6c. But @edeutsch is wondering if these edges are coming from DTD.

chunyuma commented 1 year ago

@saramsey, the xDTD is trained based on KG2.7.3c which is basically the KG2 paper used. So the edges it returned are based on KG2.7.3c. We can filter out some "bad" edges (e.g. biolink:increases_activity_of) within the ARAX_infer module if you want.

edeutsch commented 1 year ago

Right, @saramsey. The xDTD is trained based on KG2.7.3c which is basically the KG2 paper used. So the edges it returned are based on KG2.7.3c. We can filter out some "bad" edges (e.g. biolink:increases_activity_of) within the ARAX_infer module if you want.

or perhaps transform them on the fly to the correct Biolink 3.1 style?

chunyuma commented 1 year ago

@edeutsch, it is always painful and involves lots of work in re-training the whole xDTD model to keep pace with the KG2 update. The whole process might need more than 2-3 weeks. I think the easiest way is to filter out those "bad" edges within ARAX_infer.

dkoslicki commented 1 year ago

@chunyuma Eric wasn't suggesting re-training the model, rather just have rules in place that convert predicates to their new wording after xDTD has returned its results

dkoslicki commented 1 year ago

So don't filter out the edges, rather update the predicate

chunyuma commented 1 year ago

@chunyuma Eric wasn't suggesting re-training the model, rather just have rules in place that convert predicates to their new wording after xDTD has returned its results

@edeutsch @dkoslicki, is there a predicate mapping that I can use to covert predicates to their new wording after xDTD has returned its results?

chunyuma commented 1 year ago

For example, what predicate I should covert biolink:increases_activity_of to in the Biolink 3.1 style? How about increases_amount_or_activity_of? I saw biolink 3.1 has this new predicate from here.

edeutsch commented 1 year ago

The RTX-KG2 team @saramsey @acevedol have been working on this very problem, so they would know and maybe have a nice mapping

saramsey commented 1 year ago

@saramsey, the xDTD is trained based on KG2.7.3c which is basically the KG2 paper used. So the edges it returned are based on KG2.7.3c. We can filter out some "bad" edges (e.g. biolink:increases_activity_of) within the ARAX_infer module if you want.

Thank you! Yes, I think a quick-and-dirty predicate-mapping patch might be the solution here, while we wait for the new build of the DTD database on Beast. @acevedol can you help provide a mapping between KG2.7.3 predicates andKG2.8.0 predicates, for just the types of predicates that have been flagged at issue here, which I think includes biolink:increases_activity_of and (I presume, by symmetry), biolink:decreases_activity_of?

saramsey commented 1 year ago

Thank you @edeutsch for your insightful diagnosing of this issue.

edeutsch commented 1 year ago

just musing aloud, I wonder this this translation layer might be used a bit more generically even for KG2.8.0. As work to remap other old predicates to their new qualified form continues, I wonder if some general translation table could be used to update other predicates on the fly in advance of updates to KG2.8.0. i.e. as soon as you determine that A should be remapped to B, as you start work on transforming the KG load process, you could also update the on-the-fly translation layer, so that the new form is available immediately, not just at the next KG2 build.

Of course, if it were that easy, you probably would have just done that to all of KG2 already..

saramsey commented 1 year ago

@edeutsch currently, DTD doesn't query the RTX-KG2 KP and it doesn't query the PloverDB. So, I think we have three choices:

  1. ARAX-infer performs the translation on-the-fly, based on rules that are maintained in an artifact/file that resides in (and is maintained in) the ARAX code-base and thus available to be read when ARAX starts up
  2. ARAX-infer performs the translation on-the-fly, based on rules that are obtained dynamically from the RTX-KG2 API
  3. ARAX-infer performs the translation on-the-fly, based on rules that are obtained dynamically from the PloverDB API

Are those three basically the options?

I am not wild about making ARAX do the translation, on behalf of ARAX-infer. Ideally, Biolink-compliance should be expected at the boundaries between modules within ARAX. Right?

edeutsch commented 1 year ago

I am thinking that for options 2 and 3, a whole API would need to be designed and it doesn't seem worth it. Option 1 seems very reasonable and easier than 2 or 3. In theory this mechanism would be eventually obsoleted when all the various ARAX/KG2 components all caught up. But of course, by then Biolink 3.5 will add a new twist for the mechanism to consider.

Ideally, Biolink-compliance should be expected at the boundaries between modules within ARAX. Right?

I don't know. They are, in fact, all Biolink compliant. They're just compliant with different versions of Biolink, which is a fast moving target. Since it seems very difficult for all of our components to move in lockstep, it seems reasonable to have some mechanism for temporary translation.

Putting it in ARAX-infer seems reasonable if only ARAX-infer is going to be behind.

Putting it more in ARAX proper may also make sense if other things might be a little behind. Consider the possibility that next week Biolink 3.2 is released. KG2 is behind. MolePro is even further behind. But there are relatively few changes in Biolink 3.2 (perhaps just a few more predicates turned into qualifiers). The ARAX translation layer can take results from BL 3.1 KG2 and BL 3.0 Molepro and tranform it all into a tidy BL 3.2 world just applying a few relatively simple-to-encode rules. this means we don't fret as as much when KG2 seems behind and we can always claim to support the latest release? Maybe Biolink 3.2 renames ChemicalEntity to ChemicalThingy. If this is a 5-minute fix in the translation layer, then we don't need to fret about KG2 being behind and creating a new release quite so urgently. I'm probably making it seem easier than it actually would be...

saramsey commented 1 year ago

Thank you @edeutsch. Yes, to me, that's a very convincing argument for a dynamic translation layer. You also raise an interesting argument (which I hadn't considered) for centralizing the translation layer in ARAX. I guess there is not a one-to-one correspondence between TRAPI releases and Biolink releases, such that we could find a situation where an ARA and a KP both claim to be speaking "TRAPI version X", but the KP is using Biolink version Y and the ARA is expecting/speaking Biolink version Z, where Z <> Y. And such a situation is not prohibited by Translator Reasoners Standard API specification, correct?

edeutsch commented 1 year ago

Indeed there is not a one-to-one correspondence between TRAPI releases and Biolink releases. Some big changes in Biolink require a change to TRAPI (e.g. qualifiers) but most of the changes in Biolink that we've been wrestling with over the past few years (adding and changing categories and predicates) have been largely TRAPI neutral.

While 95% of all agents are solidly on TRAPI 1.3 today, I surmise that if anyone actually tried to determine this, there are probably 5 or more different Biolink versions reflected in KPs available today. And even if you tested claims of Biolink version support from various agents at face value, I bet you would find the truth is quite a bit more varied.

Even perfect translation is not necessary, I contend there's a lot of low-hanging fruit for easy 90-10 translation.

saramsey commented 1 year ago

Eric, this issue was discussed at today's AHM. I believe we decided to try to rebuild DTD against KG2.8.0c and to try to automate some of the data processing steps that currently make rebuilding DTD labor-intensive. As for a translation layer, are you thinking in terms of a web service for that?

edeutsch commented 1 year ago

Great! Sorry I could not make it today. I really would have wanted to be there, it sounds like a missed a lot, but the conference here also was crucial for me.

I was thinking of an internal Python class for the translation, not a web service. But perhaps a web service could make sense. It might help others. But would be less efficient for us if that is the only interface. I would suggest internal class/method that can be called near the end of processing.

saramsey commented 1 year ago

As the remedy has been determined to be a rebuild of the DTD database, I am taking myself off the assignee list for this issue.