Closed amykglen closed 4 years ago
An update on where I'm at with this for expand
:
I made the desired changes and it is currently functioning (KG2 expands now automatically use synonyms), though with speed issues in certain situations.
Namely, for the KG2 version of the Parkinson's example we used for the demo, i.e.:
"add_qnode(curie=DOID:14330, id=n00)",
"add_qnode(type=protein, is_set=true, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n00, target_id=n01, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01, type=molecularly_interacts_with)",
"expand(edge_id=[e00,e01], kp=ARAX/KG2)",
the expansion is now really slow - takes almost 5 minutes.
But, strangely, it's way faster when the query graph edges are expanded individually, i.e.:
"add_qnode(curie=DOID:14330, id=n00)",
"add_qnode(type=protein, is_set=true, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n00, target_id=n01, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01, type=molecularly_interacts_with)",
"expand(edge_id=e00, kp=ARAX/KG2)",
"expand(edge_id=e01, kp=ARAX/KG2)",
these two expand calls take about 20 seconds combined.
I believe this is due to the way the cypher generation code works (I think it is creating a not so ideal cypher query from the query graph in the first case described above).
I'm working on fixing that (planning to move away from the existing cypher generation code and instead spin our own), but thought I'd post an update in case others want to get started on their parts. (The changes I'm currently working on shouldn't change expand
's output - they'll just (hopefully) speed it up.)
Ok, I managed to improve the speed a bit - expansion for the Parkinson's example now takes about 15 seconds using KG2 (using equivalent curies), even when expand(edge_id=[e00,e01])
is used. (Changes are in the expander
branch.)
And some nice news is that the idiopathic pulmonary fibrosis query that expand
failed to find results for in the demo:
"add_qnode(id=n00, curie=DOID:0050156)",
"add_qnode(id=n01, type=chemical_substance, is_set=true)",
"add_qnode(id=n02, type=protein)",
"add_qedge(id=e00, source_id=n00, target_id=n01)",
"add_qedge(id=e01, source_id=n01, target_id=n02)",
"expand(edge_id=[e00,e01], kp=ARAX/KG2)",
now turns up 554 proteins (connected to idiopathic pulmonary fibrosis through 45 chemical substances). Interestingly, out of all the equivalent curies for idiopathic pulmonary fibrosis, only the MESH term (MESH:D054990) is connected to any chemical substances (and it has a type of "named_thing").
@dkoslicki @finnagin - Ok, I added three examples to ARAX_query.py
for testing KG2 expand using synonyms: 201, 202, and 203. (They're roughly equivalent to the queries for the three demo examples.)
@edeutsch @amykglen previously message.knowledge_graph.nodes[0].type
was a list (i.e. the node types were a list). However, now it appears they are not lists. Was this an API change? Or did expander
forget to populate this field with a list?
That is my mistake I think - fixing it
Ok - pushed the fix to expander
branch. Sorry about that!
Ok - pushed the fix to
expander
branch. Sorry about that!
No problem! Simple fix on both ends :)
@dkoslicki @finnagin We'll need to fiddle with filter_kg
thresholds (using something besides a raw value), since now the proper thresholds to apply drastically change when using KG2 vs KG1. For example, with ARAX_query.py 202
using KG2, after Jaccard index, we get:
number of edges with attribute jaccard_index: 1091
Mean of attribute jaccard_index: 0.008405327924800591
Median of attribute jaccard_index: 0.005319148936170213
Max of attribute jaccard_index: 0.031914893617021274
Min of attribute jaccard_index: 0.005319148936170213
Whereas for KG1, we get
number of edges with attribute jaccard_index: 1119
Mean of attribute jaccard_index: 0.04287252812721981
Median of attribute jaccard_index: 0.02564102564102564
Max of attribute jaccard_index: 0.15384615384615385
Min of attribute jaccard_index: 0.02564102564102564
So as a heads up @finnagin, we will probably need to add something like threshold_by_std_dev=x
or something like that to filter_kg()
where you can remove edges that have an attribute with a value x
standard deviations above/below the mean. Or something to that effect. It will take some thinking about how best to do this, but this was a known "thing to address" in #626
Since KG2 now has a bunch of different edge types, (eg. ARAX_query.py 202
results in 'molecularly_interacts_with': 2153, 'J1': 444, 'related_to': 115, 'is_substance_that_treats': 76, 'causes_or_contributes_to': 32, 'predisposes': 23, 'associated_with': 23, 'affects': 18, 'capable_of_inhibiting_or_preventing_pathological_process': 9, 'increase': 6, 'disrupts': 5, 'produces': 5}
), might not want to compute jaccard treating all these edges as "the same". Also speaks to #699 where a hint about what sorts of connections are of interest would be helpful to then indicate to jaccard distance which edge type(s) to concentrate on.
@finnagin @saramsey ARAX_query.py 202
(demo example 2, Parkinsons without the ML component) appears to be working just fine even with filter_kg
, resultify
and filter_results
. However, of course the list of drugs we get back do not contain cildinipine (sp?) due to different filtering thresholds mentioned above.
However, of course the list of drugs we get back do not contain cildinipine (sp?) due to different filtering thresholds mentioned above.
A potential confounding factor here is that KG2 actually currently contains fewer proteins between Parkinson's and cilnidipine, because KG2 ingests the version of KG1 at kg1endpoint.rtx.ai, which is apparently slightly different than the KG1 at arax.rtx.ai... see #591
match p=(n00 {id:'DOID:14330'})--(n01:protein)--(n02:chemical_substance) where toLower(n02.name)="cilnidipine" return p limit 30
kg2endpoint2.rtx.ai:
arax.rtx.ai:
Yeah, that would be a problem. I was under the assumption that KG1 was a proper subset of kg2endpoint2. @saramsey can you confirm that KG1/arax.rtx.ai is not a proper subset/graph of kg2endpoint2? Also, how difficult would to be to ensure that it is? Seems like a rather unfortunate thing that we're losing nodes and edges in kg2endpoint2 (i.e. some reasoning appears to be better on KG1)
@amykglen @finnagin Please see #702. Pulled this out as a separate issue as it might require a bit of coordination. See 5a2ec3c4 on overlay
branch for the example of the problem
One issue I've run into testing filter_kg is that the provided_by field is now populated by a string of a list. (e.g. "Pharos"
is now "['https://pharos.nih.gov']"
)
However, actions_parser automatically detects if the argument is a list and converts it to a list rather than a string. Thus, I'm not entirely sure how to pass a string of a list. If I pass without quotes it reads it as a list while if I pass it with the quotes it reads those as part of the string. (i.e. as "\"['https://pharos.nih.gov']\""
).
I can add some logic to filter_kg_by_property that that will check if a list is passed as an parameter for the provided_by field and change it to a string but I'm not sure if that is the best way to go about fixing this or if something should be changed in expander/actions_parser. Is the provided_by field supposed to be a string of a list now or is that an error?
@finnagin I would ask @edeutsch if the API dictates that the provided_by
must be a list and if it must be URL's and not names (and see if @saramsey made them URLs by some design choice). If so, you might be able to change your code (to make it KG1 and KG2 compatible) by checking the type of the provided_by
in the KG and then either seeing if it matches the DSL filter option, or if the DSL provided string (eg "pharos") is in any of the strings in the list given by KG provided_by
list.
Also, if they must be URL's, it will be handy for us to have a list somewhere of what URL's are actually there for our easy reference. And your code should fail gracefully (I think it already does) if the DSL filter name doesn't exist, maybe reporting what provided_by
's actually exist in the KG in some Warning
).
According to the current spec: https://github.com/NCATS-Tangerine/NCATS-ReasonerStdAPI/blob/master/API/TranslatorReasonersAPI_optional.yaml is_defined_by: type: string example: reasoner description: A CURIE/URI for the translator group that made the KG provided_by: type: string example: OMIM description: A CURIE/URI for the knowledge source that defined this edge neither are lists. These are scalars But I certainly see a reason for making them a list. We can propose it. Do we actually have lists in the KG? Their form is a bit vague. The docs say they should be CURIEs or URIs. But until we (NCATS groups) all step back and compare results and decide on the one way to do it, it's just much ado and no conclusion: https://github.com/biolink/biolink-model/issues/175
So... IMO, put whatever you want there and be prepared to change it when Translator collectively realizes that that practice (everyone does as they please) is not sustainable.
Nice - yes, edge provided_by
is a list in KG2 (and I confirmed there are at least some instances of edges having more than one item in their provided_by
list).
Though the property shouldn't be appearing as a string of a list on Edge
s output by expand
, as Finn reported - that's definitely an issue in expand
I can fix. If we want to go ahead and switch to the list paradigm for this property, I can make it an actual list. Or I guess our only other option at the moment is for expand
to just grab the first item in provided_by
for edges coming from KG2.
I would suggest making it a list and see if the system explodes. If it does, then back off to list[0]. We should bring it up to the standards committee in any case.
Just so I'm sure I understand, is the issue that during the building of KG2 if two different KSs make exactly the same assertion (edge) with matching CURIEs and everything, then the edges are not duplicated but rather the provided_by list is extended? Is that right?
That's also my understanding, though I don't know for sure. @saramsey would. :)
So right now, expand fills out Edge.provided_by
as a list for edges from KG2, but still as a string for edges from KG1. Do we want to switch over fully to a list? (And propose changing this in the API standard?)
I believe the only thing that has broken so far when provided_by
is a list is filter_kg(action=remove_edges_by_property, edge_property=provided_by
, which was being discussed above. (Assuming we do want to move to a list, I can create a separate issue for that.)
@amykglen This is actually fixed now in the filter branch. filter_kg
should work on both the string and list versions of provided_by
. I currently have it set up so that it if a property is a list remove_edges_by_property
will remove any edge that contains the provided string within that list.
Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.
One thing to consider is there are other attributes of an Edge that might vary from provider to provider, e.g. confidence, publications, qualifiers, edge_attributes. And therefore edge coalescence may only be appropriate if don't want to track such information "per provider". If we do, then they have to be separate edges.
@finnagin - Oh awesome! Thanks!
@edeutsch - Ah, got it. Yeah, that's a great point about edge coalescence... Seems like an interesting philosophical discussion.
Just so I'm sure I understand, is the issue that during the building of KG2 if two different KSs make exactly the same assertion (edge) with matching CURIEs and everything, then the edges are not duplicated but rather the provided_by list is extended? Is that right?
@amykglen is correct -- yes
Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.
One thing to consider is there are other attributes of an Edge that might vary from provider to provider, e.g. confidence, publications, qualifiers, edge_attributes. And therefore edge coalescence may only be appropriate if don't want to track such information "per provider". If we do, then they have to be separate edges.
For publications, I felt that it is unlikely to matter which provider provided which publication data for a given assertion, so combining the publications is what I currently do. But perhaps there is a way to encode the source along with the publication info.
Note that provided_by is not really in the core part of the standard (although I will push again for it to be so). It is in the "experimental" part.
I included it in KG2 because it is in the Biolink model. I have opened an issue to suggest that a list type be allowed for provided_by
I think this issue can be closed; functioning as desired, and remaining synonym-related work is happening in #707/#823.
Creating this task to track our implementation of the (pre-relay meeting) plan we came up with for #664.
Note: We decided to first try to tackle this problem of synonyms, then worry about allowing multiple node types (e.g., 'gene' and 'protein') and 'subclass of' powers.
(Others should feel free to edit this description.)