RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Adjust ARAX to use curie synonyms for KG2 queries #698

Closed amykglen closed 4 years ago

amykglen commented 4 years ago

Creating this task to track our implementation of the (pre-relay meeting) plan we came up with for #664.

Note: We decided to first try to tackle this problem of synonyms, then worry about allowing multiple node types (e.g., 'gene' and 'protein') and 'subclass of' powers.

(Others should feel free to edit this description.)

amykglen commented 4 years ago

An update on where I'm at with this for expand:

I made the desired changes and it is currently functioning (KG2 expands now automatically use synonyms), though with speed issues in certain situations.

Namely, for the KG2 version of the Parkinson's example we used for the demo, i.e.:

"add_qnode(curie=DOID:14330, id=n00)",
"add_qnode(type=protein, is_set=true, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n00, target_id=n01, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01, type=molecularly_interacts_with)",
"expand(edge_id=[e00,e01], kp=ARAX/KG2)",

the expansion is now really slow - takes almost 5 minutes.

But, strangely, it's way faster when the query graph edges are expanded individually, i.e.:

"add_qnode(curie=DOID:14330, id=n00)",
"add_qnode(type=protein, is_set=true, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n00, target_id=n01, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01, type=molecularly_interacts_with)",
"expand(edge_id=e00, kp=ARAX/KG2)",
"expand(edge_id=e01, kp=ARAX/KG2)",

these two expand calls take about 20 seconds combined.

I believe this is due to the way the cypher generation code works (I think it is creating a not so ideal cypher query from the query graph in the first case described above).

I'm working on fixing that (planning to move away from the existing cypher generation code and instead spin our own), but thought I'd post an update in case others want to get started on their parts. (The changes I'm currently working on shouldn't change expand's output - they'll just (hopefully) speed it up.)

amykglen commented 4 years ago

Ok, I managed to improve the speed a bit - expansion for the Parkinson's example now takes about 15 seconds using KG2 (using equivalent curies), even when expand(edge_id=[e00,e01]) is used. (Changes are in the expander branch.)

And some nice news is that the idiopathic pulmonary fibrosis query that expand failed to find results for in the demo:

"add_qnode(id=n00, curie=DOID:0050156)",
"add_qnode(id=n01, type=chemical_substance, is_set=true)",
"add_qnode(id=n02, type=protein)",
"add_qedge(id=e00, source_id=n00, target_id=n01)",
"add_qedge(id=e01, source_id=n01, target_id=n02)",
"expand(edge_id=[e00,e01], kp=ARAX/KG2)",

now turns up 554 proteins (connected to idiopathic pulmonary fibrosis through 45 chemical substances). Interestingly, out of all the equivalent curies for idiopathic pulmonary fibrosis, only the MESH term (MESH:D054990) is connected to any chemical substances (and it has a type of "named_thing").

amykglen commented 4 years ago

@dkoslicki @finnagin - Ok, I added three examples to ARAX_query.py for testing KG2 expand using synonyms: 201, 202, and 203. (They're roughly equivalent to the queries for the three demo examples.)

dkoslicki commented 4 years ago

@edeutsch @amykglen previously message.knowledge_graph.nodes[0].type was a list (i.e. the node types were a list). However, now it appears they are not lists. Was this an API change? Or did expander forget to populate this field with a list?

amykglen commented 4 years ago

That is my mistake I think - fixing it

amykglen commented 4 years ago

Ok - pushed the fix to expander branch. Sorry about that!

dkoslicki commented 4 years ago

Ok - pushed the fix to expander branch. Sorry about that!

No problem! Simple fix on both ends :)

dkoslicki commented 4 years ago

@dkoslicki @finnagin We'll need to fiddle with filter_kg thresholds (using something besides a raw value), since now the proper thresholds to apply drastically change when using KG2 vs KG1. For example, with ARAX_query.py 202 using KG2, after Jaccard index, we get:

number of edges with attribute jaccard_index: 1091
Mean of attribute jaccard_index: 0.008405327924800591
Median of attribute jaccard_index: 0.005319148936170213
Max of attribute jaccard_index: 0.031914893617021274
Min of attribute jaccard_index: 0.005319148936170213

Whereas for KG1, we get

number of edges with attribute jaccard_index: 1119
Mean of attribute jaccard_index: 0.04287252812721981
Median of attribute jaccard_index: 0.02564102564102564
Max of attribute jaccard_index: 0.15384615384615385
Min of attribute jaccard_index: 0.02564102564102564

So as a heads up @finnagin, we will probably need to add something like threshold_by_std_dev=x or something like that to filter_kg() where you can remove edges that have an attribute with a value x standard deviations above/below the mean. Or something to that effect. It will take some thinking about how best to do this, but this was a known "thing to address" in #626

dkoslicki commented 4 years ago

Since KG2 now has a bunch of different edge types, (eg. ARAX_query.py 202 results in 'molecularly_interacts_with': 2153, 'J1': 444, 'related_to': 115, 'is_substance_that_treats': 76, 'causes_or_contributes_to': 32, 'predisposes': 23, 'associated_with': 23, 'affects': 18, 'capable_of_inhibiting_or_preventing_pathological_process': 9, 'increase': 6, 'disrupts': 5, 'produces': 5}), might not want to compute jaccard treating all these edges as "the same". Also speaks to #699 where a hint about what sorts of connections are of interest would be helpful to then indicate to jaccard distance which edge type(s) to concentrate on.

dkoslicki commented 4 years ago

@finnagin @saramsey ARAX_query.py 202 (demo example 2, Parkinsons without the ML component) appears to be working just fine even with filter_kg, resultify and filter_results. However, of course the list of drugs we get back do not contain cildinipine (sp?) due to different filtering thresholds mentioned above.

amykglen commented 4 years ago

However, of course the list of drugs we get back do not contain cildinipine (sp?) due to different filtering thresholds mentioned above.

A potential confounding factor here is that KG2 actually currently contains fewer proteins between Parkinson's and cilnidipine, because KG2 ingests the version of KG1 at kg1endpoint.rtx.ai, which is apparently slightly different than the KG1 at arax.rtx.ai... see #591

match p=(n00 {id:'DOID:14330'})--(n01:protein)--(n02:chemical_substance) where toLower(n02.name)="cilnidipine" return p limit 30

kg2endpoint2.rtx.ai: Screen Shot 2020-04-09 at 11 17 37 AM

arax.rtx.ai: Screen Shot 2020-04-09 at 11 17 49 AM

dkoslicki commented 4 years ago

Yeah, that would be a problem. I was under the assumption that KG1 was a proper subset of kg2endpoint2. @saramsey can you confirm that KG1/arax.rtx.ai is not a proper subset/graph of kg2endpoint2? Also, how difficult would to be to ensure that it is? Seems like a rather unfortunate thing that we're losing nodes and edges in kg2endpoint2 (i.e. some reasoning appears to be better on KG1)

dkoslicki commented 4 years ago

@amykglen @finnagin Please see #702. Pulled this out as a separate issue as it might require a bit of coordination. See 5a2ec3c4 on overlay branch for the example of the problem

finnagin commented 4 years ago

One issue I've run into testing filter_kg is that the provided_by field is now populated by a string of a list. (e.g. "Pharos" is now "['https://pharos.nih.gov']")

However, actions_parser automatically detects if the argument is a list and converts it to a list rather than a string. Thus, I'm not entirely sure how to pass a string of a list. If I pass without quotes it reads it as a list while if I pass it with the quotes it reads those as part of the string. (i.e. as "\"['https://pharos.nih.gov']\"").

I can add some logic to filter_kg_by_property that that will check if a list is passed as an parameter for the provided_by field and change it to a string but I'm not sure if that is the best way to go about fixing this or if something should be changed in expander/actions_parser. Is the provided_by field supposed to be a string of a list now or is that an error?

dkoslicki commented 4 years ago

@finnagin I would ask @edeutsch if the API dictates that the provided_by must be a list and if it must be URL's and not names (and see if @saramsey made them URLs by some design choice). If so, you might be able to change your code (to make it KG1 and KG2 compatible) by checking the type of the provided_by in the KG and then either seeing if it matches the DSL filter option, or if the DSL provided string (eg "pharos") is in any of the strings in the list given by KG provided_by list.

Also, if they must be URL's, it will be handy for us to have a list somewhere of what URL's are actually there for our easy reference. And your code should fail gracefully (I think it already does) if the DSL filter name doesn't exist, maybe reporting what provided_by's actually exist in the KG in some Warning).

edeutsch commented 4 years ago

According to the current spec: https://github.com/NCATS-Tangerine/NCATS-ReasonerStdAPI/blob/master/API/TranslatorReasonersAPI_optional.yaml is_defined_by: type: string example: reasoner description: A CURIE/URI for the translator group that made the KG provided_by: type: string example: OMIM description: A CURIE/URI for the knowledge source that defined this edge neither are lists. These are scalars But I certainly see a reason for making them a list. We can propose it. Do we actually have lists in the KG? Their form is a bit vague. The docs say they should be CURIEs or URIs. But until we (NCATS groups) all step back and compare results and decide on the one way to do it, it's just much ado and no conclusion: https://github.com/biolink/biolink-model/issues/175

So... IMO, put whatever you want there and be prepared to change it when Translator collectively realizes that that practice (everyone does as they please) is not sustainable.

amykglen commented 4 years ago

Nice - yes, edge provided_by is a list in KG2 (and I confirmed there are at least some instances of edges having more than one item in their provided_by list).

Though the property shouldn't be appearing as a string of a list on Edges output by expand, as Finn reported - that's definitely an issue in expand I can fix. If we want to go ahead and switch to the list paradigm for this property, I can make it an actual list. Or I guess our only other option at the moment is for expand to just grab the first item in provided_by for edges coming from KG2.

edeutsch commented 4 years ago

I would suggest making it a list and see if the system explodes. If it does, then back off to list[0]. We should bring it up to the standards committee in any case.

edeutsch commented 4 years ago

Just so I'm sure I understand, is the issue that during the building of KG2 if two different KSs make exactly the same assertion (edge) with matching CURIEs and everything, then the edges are not duplicated but rather the provided_by list is extended? Is that right?

amykglen commented 4 years ago

That's also my understanding, though I don't know for sure. @saramsey would. :)

amykglen commented 4 years ago

So right now, expand fills out Edge.provided_by as a list for edges from KG2, but still as a string for edges from KG1. Do we want to switch over fully to a list? (And propose changing this in the API standard?)

I believe the only thing that has broken so far when provided_by is a list is filter_kg(action=remove_edges_by_property, edge_property=provided_by, which was being discussed above. (Assuming we do want to move to a list, I can create a separate issue for that.)

finnagin commented 4 years ago

@amykglen This is actually fixed now in the filter branch. filter_kg should work on both the string and list versions of provided_by. I currently have it set up so that it if a property is a list remove_edges_by_property will remove any edge that contains the provided string within that list.

edeutsch commented 4 years ago

Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.

One thing to consider is there are other attributes of an Edge that might vary from provider to provider, e.g. confidence, publications, qualifiers, edge_attributes. And therefore edge coalescence may only be appropriate if don't want to track such information "per provider". If we do, then they have to be separate edges.

amykglen commented 4 years ago

@finnagin - Oh awesome! Thanks!

@edeutsch - Ah, got it. Yeah, that's a great point about edge coalescence... Seems like an interesting philosophical discussion.

saramsey commented 4 years ago

Just so I'm sure I understand, is the issue that during the building of KG2 if two different KSs make exactly the same assertion (edge) with matching CURIEs and everything, then the edges are not duplicated but rather the provided_by list is extended? Is that right?

@amykglen is correct -- yes

saramsey commented 4 years ago

Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.

One thing to consider is there are other attributes of an Edge that might vary from provider to provider, e.g. confidence, publications, qualifiers, edge_attributes. And therefore edge coalescence may only be appropriate if don't want to track such information "per provider". If we do, then they have to be separate edges.

For publications, I felt that it is unlikely to matter which provider provided which publication data for a given assertion, so combining the publications is what I currently do. But perhaps there is a way to encode the source along with the publication info.

saramsey commented 4 years ago

Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.

I included it in KG2 because it is in the Biolink model. I have opened an issue to suggest that a list type be allowed for provided_by

https://github.com/biolink/biolink-model/issues/336

amykglen commented 4 years ago

I think this issue can be closed; functioning as desired, and remaining synonym-related work is happening in #707/#823.