Closed dkoslicki closed 1 year ago
I think this is because this is an "inferred"/creative query, meaning all those edges were added by ARAXInfer. I'm guessing ARAXInfer must not do any 'decorating' of edges? @chunyuma
(note this issue came up in the last paragraph of this comment too)
there is an ARAXDecorator class that expand uses to add attributes to bare KG2 edges, though it looks like ARAXInfer returns before reaching that step in Expand. this is where Expand uses it: https://github.com/RTXteam/RTX/blob/9bf05a5cc7cbb731ebbaba32a786b8f821780c45/code/ARAX/ARAXQuery/ARAX_expander.py#L679-L683
but ARAXInfer could call ARAXDecorator itself. I'm guessing it won't have 100% coverage of edges if ARAXInfer is using an older version of KG2, but it'd probably get a good portion at least(?)
in addition to using ARAXDecorator, I think ARAXInfer would ideally also add some sort of attribute to each edge indicating that that edge was added by ARAXInfer - otherwise it looks like the edge simply came from KG2, which is confusing for debugging/understanding results...
@amykglen, thanks for the suggestions and thoughts for this issue. I will take a look and try to fix this issue within the next few days.
ok, I've added more attributes to the edges generated by ARAXInfer:
infores:rtx-kg2:DRUGBANK:DB11164-biolink:physically_interacts_with-FMA:67238
{'attributes': [{'attribute_source': 'infores:arax',
'attribute_type_id': 'biolink:aggregator_knowledge_source',
'attributes': None,
'description': None,
'original_attribute_name': 'provided_by',
'value': 'infores:arax',
'value_type_id': 'biolink:InformationResource',
'value_url': None},
{'attribute_source': 'infores:arax',
'attribute_type_id': 'biolink:computed_value',
'attributes': None,
'description': 'This edge is inferred by ARAXInfer based on '
'RTX-KG2c.',
'original_attribute_name': None,
'value': True,
'value_type_id': 'metatype:Boolean',
'value_url': None}],
'object': 'FMA:67238',
'predicate': 'biolink:physically_interacts_with',
'qualifiers': None,
'subject': 'DRUGBANK:DB11164'}
@dkoslicki @amykglen, are these attributes enough or we need more?
I think they specifically wanted primary_knowledge_source
. @amykglen Any idea where to get this information?
I didn't see the primary_knowledge_source
attribute appears in our ARAXOverlay edges (see an example below). I'm wondering if biolink:aggregator_knowledge_source
is the same as primary_knowledge_source
because they are the computed edges provided by ARAX.
edge_attribute_list = [
EdgeAttribute(original_attribute_name="virtual_relation_label", value=relation, attribute_type_id="biolink:Unknown"),
EdgeAttribute(original_attribute_name="defined_datetime", value=defined_datetime, attribute_type_id="metatype:Datetime"),
EdgeAttribute(original_attribute_name="provided_by", value=provided_by, attribute_type_id="biolink:aggregator_knowledge_source", attribute_source=provided_by, value_type_id="biolink:InformationResource"),
EdgeAttribute(original_attribute_name=None, value=True, attribute_type_id="biolink:computed_value", attribute_source="infores:arax", value_type_id="metatype:Boolean", value_url=None, description="This edge is a container for a computed value between two nodes that is not directly attachable to other edges.")
]
I think they were looking for knowledge_source
, sorry, as seen below:
indeed primary_knowledge_source is the most crucial: https://biolink.github.io/biolink-model/docs/primary_knowledge_source.html
Tyler has declared that every edge must have a primary_knowledge_source and this is very important.
For the machine learning model-inferred edges, we need to come up with a way to acknowledge that the inferred edge is a product of this algorithm. If neither "arax" nor "ktx-kg2" seem like the appropriate source to acknowledge, then perhaps we need to register a new knowledge source that captures this.
If we don't know have model-inferred edges yet, and just the supporting edges, then those should have their primary_knowledge_provider information from RTX-KG2
ok, thanks @edeutsch. I've added the knowledge_source
attributes to all Overlay predicates and ARAXInfer edges. Temporarily, I used "infores:rtx-kg2" as its value, but we can change it if we can have a more appropriate source.
note: right now what is called knowledge_source
on KG2 edges is actually the primary_knowledge_source
- we just need to update the name of that attribute. (that should be a very quick fix; I'll try to do it today).
actually, I take that back - it sounds like the definition of primary_knowledge_source
is more complex, and often we don't know what that is for KG2 edges:
https://github.com/biolink/biolink-model/blob/24efce8b69c8210c0d98f30534b7d2736c19a57b/biolink-model.yaml#L5017-L5032
primary knowledge source: is_a: knowledge source description: >- The most upstream source of the knowledge expressed in an Association that an implementer can identify. Performing a rigorous analysis of upstream data providers is expected; every effort is made to catalog the most upstream source of data in this property. Only one data source should be declared primary in any association. "aggregator knowledge source" can be used to caputre non-primary sources. range: information resource notes: >- For example: a single ChemicalToGene Edge originally curated by ClinicalTrials.org, is aggregated by ChEMBL, then incorporated into the MolePro KP, then sent via TRAPI message to the ARAGORN ARA, and finally sent to the NCATS ARS. The retrieval path for this Edge is as follows: ARS--retrieved_from--> ARAGORN --retrieved_from--> MolePro --retrieved_from--> ChEMBL --retrieved_from--> ClinicalTrials.gov The "primary knowledge source" for this edge is "infores:clinical-trials-gov". "infores:chembl" and "infores:molecular_data_provider" are listed in the "aggregator knowledge source" property. multivalued: false
Great. indeed "primary_knowledge_source" is what Tyler is looking for. A "primary_knowledge_source" value for every edge is important to him.
If we are missing "primary_knowledge_source" for any edges in KG2, it should be a top priority to create them. Tyler is very passionate about this.
@saramsey @acevedol - do we know any of the primary_knowledge_source
s for KG2 edges, according to this definition? would we need to sit down and determine for each of KG2's knowledge sources, what its primary_knowledge_source
is?
if we came up with a map of knowledge_source
(as stored on KG2pre edges) --> primary_knowledge_source
, that could easily be incorporated into Expand (not necessitating a KG2 rebuild). but if edges from the same knowledge_source
can come from different primary_knowledge_source
s, that may be a bit more involved...
Must it be a single string, or a list? If the latter, a union seems to make sense. If the former, then I guess we can't do much better than to pick an arbitrary one when there are multiple primaries
according to the definition it has to be a string; only one data source can be declared primary per edge. that might be a little bit of a problem for KG2 in its current design, since it merges edges from different knowledge sources.
re: multiple primary_knowledge_source
s per knowledge source: I meant like, say some Chembl edges originate from clinicaltrials.gov, but other Chembl edges come from some other primary data source. In that case there's still only one primary_knowledge_source
for each Chembl edge, but that primary_knowledge_source
isn't the same for all Chembl edges.
Ah, I see now. Given how much it's being emphasized, maybe the simple map would be a 1st order approach (though will be wrong for some edges), and then do it correctly the next KG2 build?
yeah, that seems reasonable to me. though I know nothing currently about what the primary_knowledge_source
s are for KG2pre's sources - seems like it will take some research. could @acevedol or @saramsey possibly help with this?
I believe KG2pre's knowledge sources are stored in a list with the key "knowledge_source". When we find an edge with an existing key, but a different knowledge source, it is appended to the list. For example, "knowledge_source": [ "infores:fma-obo", "infores:fma-umls" ],
@chunyuma - have you looked into using ARAXDecorator for Infer queries yet? (as I noted here). on today's TAQA call, there's a lot of concern over our creative mode edges (the ones Infer pulls out of KG2) lacking attributes. the edges show that they came from ARAX, but they don't have any of the other EPC attributes that are present on those edges in KG2 (like publications, etc.):
https://arax.ncats.io/?r=134431
if you just run the KG2 edges that Infer uses for explanation graphs through ARAXDecorator, that should add the missing attributes.
some of the edges still won't be decorated properly currently because they're from an old KG2 version, but I would expect that many edges would
also, I'm not sure if using ARAXDecorator would add KG2's underlying primary knowledge source (e.g., chembl) to the edge, which would be a problem. maybe we should do some brainstorming to figure out how we can get all these attributes onto creative mode edges
fyi, this is the summary of today's discussion where this issue came up: https://github.com/NCATSTranslator/Feedback/issues/152#issuecomment-1518102547
also, I don't think this message on the above example edge is quite right:
This edge is inferred by ARAXInfer based on RTX-KG2c.
that particular edge is not an inferred edge, but instead an actual edge in KG2 that ARAXInfer is using to help explain one of its inferred edges, right?
I think most everyone in TAQA assumed that that edge was an inferred edge - i.e., one that ARAX 'made up' - because of that message.
I do think its ideal to indicate somehow that it was ARAXInfer that added that KG2 edge to the answer, but we should make it clear that that edge wasn't inferred.
Here are my thoughts to solve this issue:
status from minihackathon: @chunyuma expects to address this by tomorrow
re: the message This edge is inferred by ARAXInfer based on RTX-KG2c
- I think we should use that message for the treats
edge that ARAXInfer creates (infers), but maybe we should use something like This edge was extracted from RTX-KG2c by ARAXInfer
for the KG2 edges that it uses to explain the predicted drug
it would also be awesome if it included the KG2 version number! so, like:
This edge was extracted from RTX-KG2.7.3c by ARAXInfer
but if that's complex we can skip that
also, right now that message is in a 'computed value' attribute, but I wouldn't think supporting KG2 edges should be considered computed values..
I'd think the source-related attributes for supporting KG2 edges should look something like this:
attribute_source: infores:arax
attribute_type_id: biolink:aggregator_knowledge_source
value_type_id: biolink:InformationResource
value: infores:arax
attribute_source: infores:arax
attribute_type_id: biolink:aggregator_knowledge_source
value_type_id: biolink:InformationResource
value: infores:rtx-kg2
description: "This edge was extracted from RTX-KG2.7.3c by ARAXInfer"
attribute_source: infores:arax
attribute_type_id: biolink:primary_knowledge_source
value_type_id: biolink:InformationResource
value: infores:drugbank
also note that the way that edge sources are recorded has changed in TRAPI 1.4, so this information is stored under Edge.sources
in the NewFmt
branch, rather than under attributes
...
@amykglen, regarding the source-related attributes for supporting KG2 edges, I saw the Edge
format under the NewFmt
branch looks like this:
'creative_DTD_prediction_24': {'attributes': [{'attribute_source': None,
'attribute_type_id': 'metatype:Datetime',
'attributes': None,
'description': None,
'original_attribute_name': 'defined_datetime',
'value': '2023-04-26 16:53:18',
'value_type_id': None,
'value_url': None},
{'attribute_source': 'infores:arax',
'attribute_type_id': 'EDAM-DATA:1772',
'attributes': None,
'description': 'This edge is a container for a computed value '
'between two nodes that is not directly '
'attachable to other edges.',
'original_attribute_name': None,
'value': True,
'value_type_id': 'metatype:Boolean',
'value_url': None},
{'attribute_source': None,
'attribute_type_id': 'EDAM-DATA:0951',
'attributes': None,
'description': None,
'original_attribute_name': 'probability_treats',
'value': '0.8208053384790936',
'value_type_id': None,
'value_url': None}],
'object': 'MONDO:0004975',
'predicate': 'biolink:treats',
'qualifiers': None,
'sources': [{'resource_id': 'infores:arax',
'resource_role': 'biolink:primary_knowledge_source',
'source_record_urls': None,
'upstream_resource_ids': None},
{'resource_id': 'infores:arax',
'resource_role': 'biolink:aggregator_knowledge_source',
'source_record_urls': None,
'upstream_resource_ids': None}],
'subject': 'RXNORM:25696'}
Is there correct? It seem not like what you told me in this comment.
The information under Edge.sources
doesn't have attribute_source
, attribute_type_id
, value_type_id
, value
and description
. These attributes are under Edge.attributes
. Should I still put them under Edge.sources
?
@chunyuma - yes, the example I gave you was in TRAPI 1.3 format. for NewFmt
/TRAPI 1.4 - yes, that information has been moved to Edge.sources
, which is a bit different. would look something like this I suppose, for a supporting KG2 edge:
'sources': [{'resource_id': 'infores:drugbank',
'resource_role': 'biolink:primary_knowledge_source',
'source_record_urls': None,
'upstream_resource_ids': None},
{'resource_id': 'infores:rtx-kg2',
'resource_role': 'biolink:aggregator_knowledge_source',
'source_record_urls': None,
'upstream_resource_ids': ['infores:drugbank']},
{'resource_id': 'infores:arax',
'resource_role': 'biolink:aggregator_knowledge_source',
'source_record_urls': None,
'upstream_resource_ids': ['infores:rtx-kg2']}],
and I suppose there is no longer a description
property, so perhaps we can't include a message about ARAXInfer extracting the edge from KG2 anyway, so we can't make it obvious that the edge was extracted by ARAXInfer. unless ARAXInfer had its own infores curie, so we could add it as another source (between kg2 and arax)...
ok, thanks @amykglen! Considering there are still lots of changes in NewFmt
/TRAPI 1.4, I will not do this fix in NewFmt
/TRAPI 1.4 but will do it in the master
branch.
Good idea to fix/develop as much as possible that is not related to TRAPI 1.4 breaking changes in master
. Although I have not done it in a while, it is the goal to keep merging master
into NewFmt
regularly.
and I suppose there is no longer a description property, so perhaps we can't include a message about ARAXInfer extracting the edge from KG2 anyway, so we can't make it obvious that the edge was extracted by ARAXInfer. unless ARAXInfer had its own infores curie, so we could add it as another source (between kg2 and arax)...
I would say that the main inferred "treats" edge should have a primary_knowledge_source of arax. But all the supporting edges in the auxiliary graph should have their primary_knowledge_sources originating with drugbank, etc.
I would say that the main inferred "treats" edge should have a primary_knowledge_source of arax. But all the supporting edges in the auxiliary graph should have their primary_knowledge_sources originating with drugbank, etc.
yes, I completely agree. I was trying to think of a way that allows us to, in addition, continue to show on a supporting KG2 edge that that edge was extracted from KG2 by ARAXInfer vs. retrieved via the KG2 API, because that's very helpful for debugging reasons. but it's ok if we no longer have a way of indicating that with TRAPI 1.4
Hi @amykglen and @edeutsch, I have added the missing EPC attributes to the ARAXInfer-generated KG2 edges and also changed the message to This edge was extracted from RTX-KG2.7.3c by ARAXInfer
. I tested it locally and it works.
Eric, could you please roll out the master branch to the server so that we can test it on /beta? Thank you! By the way, I've updated the database ExplainableDTD_v1.0_KG2.8.0.db
under /home/rtxconfig/KG2.8.0/ on the arax-databases.rtx.ai
server for this goal.
Hi @chunyuma I have deployed master
to all endpoints that follow master
including /beta. Please go ahead and test.
code/ARAX/ARAXQuery/ARAX_database_manager.py | 6 +-
code/ARAX/ARAXQuery/ARAX_decorator.py | 133 ++++++++++++++++++---------
code/ARAX/ARAXQuery/ARAX_infer.py | 1 +
code/ARAX/ARAXQuery/Infer/scripts/ExplianableDTD_db.py | 2 +-
code/ARAX/ARAXQuery/Infer/scripts/build_mapping_db.py | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
code/ARAX/ARAXQuery/Infer/scripts/infer_utilities.py | 260 +++++++++++++++++++++++++---------------------------
code/ARAX/ARAXQuery/operation_to_ARAXi.py | 17 +++-
code/ARAX/test/test_ARAX_infer.py | 2 +-
code/RTXConfiguration.py | 2 +-
I tried running the ARAX_database_manager but it did not seem to think anything needed to be updated. I don't know if that is a problem or expected.
thanks @chunyuma! since you changed the database, you have to change the database name somehow (i.e., bump v1.0 --> v1.1) in order for the database manager to know to download the new database.
I'll go ahead and make this change now, and also upload the revised database to the ITRB SFTP server (after which I'll update config_dbs.json in master); at that point I think running the database manager should do the trick
ok - I made those changes (renamed the updated db from v1.0 --> v1.1, uploaded it to the ITRB SFTP server, and changed config_dbs.json
in master)
(I confirmed before doing those things that the ARAXInfer tests were failing in master, likely because the new code was using the old database; hopefully after this they pass)
so @edeutsch at this point if you roll out master
and run the database manager, I think @chunyuma's fix should be deployed
@chunyuma - using the new database, this test is failing in master:
FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'
Hi @amykglen @chunyuma I was about to roll out master
and manage databases. But should I wait until the above test is addressed? Or full steam ahead?
I would definitely say roll it out, since prior to my changes a whole bunch of XDTD tests were failing. now it's just one!
okay, I have rolled out master
to all the endpoints following master
(devED, devLM, /test, /beta, kg2beta) and run the database manager on all as well. It seems to have updated to 1.1 okay:
explainable_dtd_db has a local version, '1.0_KG2.8.0', which does not match the remote version, '1.1_KG2.8.0'.
The example query seems to be working although calls to the service provider seem to be failing, although probably not our fault:
@chunyuma will look into this test:
FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'
Once all tests pass in master
, then can close this issue
Confirmed that this is the only test failing now:
====================================== 1 failed, 116 passed, 125 skipped, 61904 warnings in 439.00s (0:07:18) =======================================
I've fixed this bug:
FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'
I think we can close this issue. XDTD KG2 edges now look great except for their sources, which I wrote up in #2033
For ARAX results: https://arax.rtx.ai/?r=5dc8443a-f275-4954-be53-830d000d9878 It looks like a lot of edges have with only
biolink:aggregator_knowledge_source
, and nothing else (including primary knowledge source). I don't know if this is anexpand
issue (during the decorating phase), or a KG2 issue. So tagging @amykglen and @saramsey