RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 20 forks source link

Missing EPC #1970

Closed dkoslicki closed 1 year ago

dkoslicki commented 1 year ago

For ARAX results: https://arax.rtx.ai/?r=5dc8443a-f275-4954-be53-830d000d9878 It looks like a lot of edges have with only biolink:aggregator_knowledge_source, and nothing else (including primary knowledge source). I don't know if this is an expand issue (during the decorating phase), or a KG2 issue. So tagging @amykglen and @saramsey

amykglen commented 1 year ago

I think this is because this is an "inferred"/creative query, meaning all those edges were added by ARAXInfer. I'm guessing ARAXInfer must not do any 'decorating' of edges? @chunyuma

(note this issue came up in the last paragraph of this comment too)

there is an ARAXDecorator class that expand uses to add attributes to bare KG2 edges, though it looks like ARAXInfer returns before reaching that step in Expand. this is where Expand uses it: https://github.com/RTXteam/RTX/blob/9bf05a5cc7cbb731ebbaba32a786b8f821780c45/code/ARAX/ARAXQuery/ARAX_expander.py#L679-L683

but ARAXInfer could call ARAXDecorator itself. I'm guessing it won't have 100% coverage of edges if ARAXInfer is using an older version of KG2, but it'd probably get a good portion at least(?)

in addition to using ARAXDecorator, I think ARAXInfer would ideally also add some sort of attribute to each edge indicating that that edge was added by ARAXInfer - otherwise it looks like the edge simply came from KG2, which is confusing for debugging/understanding results...

chunyuma commented 1 year ago

@amykglen, thanks for the suggestions and thoughts for this issue. I will take a look and try to fix this issue within the next few days.

chunyuma commented 1 year ago

ok, I've added more attributes to the edges generated by ARAXInfer:

infores:rtx-kg2:DRUGBANK:DB11164-biolink:physically_interacts_with-FMA:67238
{'attributes': [{'attribute_source': 'infores:arax',
                 'attribute_type_id': 'biolink:aggregator_knowledge_source',
                 'attributes': None,
                 'description': None,
                 'original_attribute_name': 'provided_by',
                 'value': 'infores:arax',
                 'value_type_id': 'biolink:InformationResource',
                 'value_url': None},
                {'attribute_source': 'infores:arax',
                 'attribute_type_id': 'biolink:computed_value',
                 'attributes': None,
                 'description': 'This edge is inferred by ARAXInfer based on '
                                'RTX-KG2c.',
                 'original_attribute_name': None,
                 'value': True,
                 'value_type_id': 'metatype:Boolean',
                 'value_url': None}],
 'object': 'FMA:67238',
 'predicate': 'biolink:physically_interacts_with',
 'qualifiers': None,
 'subject': 'DRUGBANK:DB11164'}

@dkoslicki @amykglen, are these attributes enough or we need more?

dkoslicki commented 1 year ago

I think they specifically wanted primary_knowledge_source. @amykglen Any idea where to get this information?

chunyuma commented 1 year ago

I didn't see the primary_knowledge_source attribute appears in our ARAXOverlay edges (see an example below). I'm wondering if biolink:aggregator_knowledge_source is the same as primary_knowledge_source because they are the computed edges provided by ARAX.

edge_attribute_list = [
    EdgeAttribute(original_attribute_name="virtual_relation_label", value=relation, attribute_type_id="biolink:Unknown"),
    EdgeAttribute(original_attribute_name="defined_datetime", value=defined_datetime, attribute_type_id="metatype:Datetime"),
    EdgeAttribute(original_attribute_name="provided_by", value=provided_by, attribute_type_id="biolink:aggregator_knowledge_source", attribute_source=provided_by, value_type_id="biolink:InformationResource"),
    EdgeAttribute(original_attribute_name=None, value=True, attribute_type_id="biolink:computed_value", attribute_source="infores:arax", value_type_id="metatype:Boolean", value_url=None, description="This edge is a container for a computed value between two nodes that is not directly attachable to other edges.")
]
dkoslicki commented 1 year ago

I think they were looking for knowledge_source, sorry, as seen below: image

edeutsch commented 1 year ago

indeed primary_knowledge_source is the most crucial: https://biolink.github.io/biolink-model/docs/primary_knowledge_source.html

Tyler has declared that every edge must have a primary_knowledge_source and this is very important.

For the machine learning model-inferred edges, we need to come up with a way to acknowledge that the inferred edge is a product of this algorithm. If neither "arax" nor "ktx-kg2" seem like the appropriate source to acknowledge, then perhaps we need to register a new knowledge source that captures this.

If we don't know have model-inferred edges yet, and just the supporting edges, then those should have their primary_knowledge_provider information from RTX-KG2

chunyuma commented 1 year ago

ok, thanks @edeutsch. I've added the knowledge_source attributes to all Overlay predicates and ARAXInfer edges. Temporarily, I used "infores:rtx-kg2" as its value, but we can change it if we can have a more appropriate source.

amykglen commented 1 year ago

note: right now what is called knowledge_source on KG2 edges is actually the primary_knowledge_source - we just need to update the name of that attribute. (that should be a very quick fix; I'll try to do it today).

amykglen commented 1 year ago

actually, I take that back - it sounds like the definition of primary_knowledge_source is more complex, and often we don't know what that is for KG2 edges: https://github.com/biolink/biolink-model/blob/24efce8b69c8210c0d98f30534b7d2736c19a57b/biolink-model.yaml#L5017-L5032

primary knowledge source: is_a: knowledge source description: >- The most upstream source of the knowledge expressed in an Association that an implementer can identify. Performing a rigorous analysis of upstream data providers is expected; every effort is made to catalog the most upstream source of data in this property. Only one data source should be declared primary in any association. "aggregator knowledge source" can be used to caputre non-primary sources. range: information resource notes: >- For example: a single ChemicalToGene Edge originally curated by ClinicalTrials.org, is aggregated by ChEMBL, then incorporated into the MolePro KP, then sent via TRAPI message to the ARAGORN ARA, and finally sent to the NCATS ARS. The retrieval path for this Edge is as follows: ARS--retrieved_from--> ARAGORN --retrieved_from--> MolePro --retrieved_from--> ChEMBL --retrieved_from--> ClinicalTrials.gov The "primary knowledge source" for this edge is "infores:clinical-trials-gov". "infores:chembl" and "infores:molecular_data_provider" are listed in the "aggregator knowledge source" property. multivalued: false

edeutsch commented 1 year ago

Great. indeed "primary_knowledge_source" is what Tyler is looking for. A "primary_knowledge_source" value for every edge is important to him.

edeutsch commented 1 year ago

If we are missing "primary_knowledge_source" for any edges in KG2, it should be a top priority to create them. Tyler is very passionate about this.

amykglen commented 1 year ago

@saramsey @acevedol - do we know any of the primary_knowledge_sources for KG2 edges, according to this definition? would we need to sit down and determine for each of KG2's knowledge sources, what its primary_knowledge_source is?

if we came up with a map of knowledge_source (as stored on KG2pre edges) --> primary_knowledge_source, that could easily be incorporated into Expand (not necessitating a KG2 rebuild). but if edges from the same knowledge_source can come from different primary_knowledge_sources, that may be a bit more involved...

dkoslicki commented 1 year ago

Must it be a single string, or a list? If the latter, a union seems to make sense. If the former, then I guess we can't do much better than to pick an arbitrary one when there are multiple primaries

amykglen commented 1 year ago

according to the definition it has to be a string; only one data source can be declared primary per edge. that might be a little bit of a problem for KG2 in its current design, since it merges edges from different knowledge sources.

re: multiple primary_knowledge_sources per knowledge source: I meant like, say some Chembl edges originate from clinicaltrials.gov, but other Chembl edges come from some other primary data source. In that case there's still only one primary_knowledge_source for each Chembl edge, but that primary_knowledge_source isn't the same for all Chembl edges.

dkoslicki commented 1 year ago

Ah, I see now. Given how much it's being emphasized, maybe the simple map would be a 1st order approach (though will be wrong for some edges), and then do it correctly the next KG2 build?

amykglen commented 1 year ago

yeah, that seems reasonable to me. though I know nothing currently about what the primary_knowledge_sources are for KG2pre's sources - seems like it will take some research. could @acevedol or @saramsey possibly help with this?

acevedol commented 1 year ago

I believe KG2pre's knowledge sources are stored in a list with the key "knowledge_source". When we find an edge with an existing key, but a different knowledge source, it is appended to the list. For example, "knowledge_source": [ "infores:fma-obo", "infores:fma-umls" ],

amykglen commented 1 year ago

@chunyuma - have you looked into using ARAXDecorator for Infer queries yet? (as I noted here). on today's TAQA call, there's a lot of concern over our creative mode edges (the ones Infer pulls out of KG2) lacking attributes. the edges show that they came from ARAX, but they don't have any of the other EPC attributes that are present on those edges in KG2 (like publications, etc.):

Screen Shot 2023-04-21 at 9 34 33 AM

https://arax.ncats.io/?r=134431

if you just run the KG2 edges that Infer uses for explanation graphs through ARAXDecorator, that should add the missing attributes.

amykglen commented 1 year ago

some of the edges still won't be decorated properly currently because they're from an old KG2 version, but I would expect that many edges would

amykglen commented 1 year ago

also, I'm not sure if using ARAXDecorator would add KG2's underlying primary knowledge source (e.g., chembl) to the edge, which would be a problem. maybe we should do some brainstorming to figure out how we can get all these attributes onto creative mode edges

amykglen commented 1 year ago

fyi, this is the summary of today's discussion where this issue came up: https://github.com/NCATSTranslator/Feedback/issues/152#issuecomment-1518102547

amykglen commented 1 year ago

also, I don't think this message on the above example edge is quite right:

This edge is inferred by ARAXInfer based on RTX-KG2c.

that particular edge is not an inferred edge, but instead an actual edge in KG2 that ARAXInfer is using to help explain one of its inferred edges, right?

I think most everyone in TAQA assumed that that edge was an inferred edge - i.e., one that ARAX 'made up' - because of that message.

I do think its ideal to indicate somehow that it was ARAXInfer that added that KG2 edge to the answer, but we should make it clear that that edge wasn't inferred.

chunyuma commented 1 year ago

Here are my thoughts to solve this issue:

  1. Considering these "creative mode" edges are all pulled from KG2.7.3c, I can base on the triple to extract the attribute from KG2.7.3c tsv files. How do you think @amykglen?
  2. Regarding the message "This edge is inferred by ARAXInfer based on RTX-KG2c.", it is easy to change the message. What message do you think is more appropriate?
amykglen commented 1 year ago

status from minihackathon: @chunyuma expects to address this by tomorrow

amykglen commented 1 year ago

re: the message This edge is inferred by ARAXInfer based on RTX-KG2c - I think we should use that message for the treats edge that ARAXInfer creates (infers), but maybe we should use something like This edge was extracted from RTX-KG2c by ARAXInfer for the KG2 edges that it uses to explain the predicted drug

amykglen commented 1 year ago

it would also be awesome if it included the KG2 version number! so, like:

This edge was extracted from RTX-KG2.7.3c by ARAXInfer

but if that's complex we can skip that

amykglen commented 1 year ago

also, right now that message is in a 'computed value' attribute, but I wouldn't think supporting KG2 edges should be considered computed values..

I'd think the source-related attributes for supporting KG2 edges should look something like this:

attribute_source: infores:arax
attribute_type_id: biolink:aggregator_knowledge_source
value_type_id: biolink:InformationResource
value: infores:arax

attribute_source: infores:arax
attribute_type_id: biolink:aggregator_knowledge_source
value_type_id: biolink:InformationResource
value: infores:rtx-kg2
description: "This edge was extracted from RTX-KG2.7.3c by ARAXInfer"

attribute_source: infores:arax
attribute_type_id: biolink:primary_knowledge_source
value_type_id: biolink:InformationResource
value: infores:drugbank

also note that the way that edge sources are recorded has changed in TRAPI 1.4, so this information is stored under Edge.sources in the NewFmt branch, rather than under attributes...

chunyuma commented 1 year ago

@amykglen, regarding the source-related attributes for supporting KG2 edges, I saw the Edge format under the NewFmt branch looks like this:

 'creative_DTD_prediction_24': {'attributes': [{'attribute_source': None,
                  'attribute_type_id': 'metatype:Datetime',
                  'attributes': None,
                  'description': None,
                  'original_attribute_name': 'defined_datetime',
                  'value': '2023-04-26 16:53:18',
                  'value_type_id': None,
                  'value_url': None},
                 {'attribute_source': 'infores:arax',
                  'attribute_type_id': 'EDAM-DATA:1772',
                  'attributes': None,
                  'description': 'This edge is a container for a computed value '
                                 'between two nodes that is not directly '
                                 'attachable to other edges.',
                  'original_attribute_name': None,
                  'value': True,
                  'value_type_id': 'metatype:Boolean',
                  'value_url': None},
                 {'attribute_source': None,
                  'attribute_type_id': 'EDAM-DATA:0951',
                  'attributes': None,
                  'description': None,
                  'original_attribute_name': 'probability_treats',
                  'value': '0.8208053384790936',
                  'value_type_id': None,
                  'value_url': None}],
  'object': 'MONDO:0004975',
  'predicate': 'biolink:treats',
  'qualifiers': None,
  'sources': [{'resource_id': 'infores:arax',
               'resource_role': 'biolink:primary_knowledge_source',
               'source_record_urls': None,
               'upstream_resource_ids': None},
              {'resource_id': 'infores:arax',
               'resource_role': 'biolink:aggregator_knowledge_source',
               'source_record_urls': None,
               'upstream_resource_ids': None}],
  'subject': 'RXNORM:25696'}

Is there correct? It seem not like what you told me in this comment.

chunyuma commented 1 year ago

The information under Edge.sources doesn't have attribute_source, attribute_type_id, value_type_id, value and description. These attributes are under Edge.attributes. Should I still put them under Edge.sources?

amykglen commented 1 year ago

@chunyuma - yes, the example I gave you was in TRAPI 1.3 format. for NewFmt/TRAPI 1.4 - yes, that information has been moved to Edge.sources, which is a bit different. would look something like this I suppose, for a supporting KG2 edge:

  'sources': [{'resource_id': 'infores:drugbank',
               'resource_role': 'biolink:primary_knowledge_source',
               'source_record_urls': None,
               'upstream_resource_ids': None},
              {'resource_id': 'infores:rtx-kg2',
               'resource_role': 'biolink:aggregator_knowledge_source',
               'source_record_urls': None,
               'upstream_resource_ids': ['infores:drugbank']},
              {'resource_id': 'infores:arax',
               'resource_role': 'biolink:aggregator_knowledge_source',
               'source_record_urls': None,
               'upstream_resource_ids': ['infores:rtx-kg2']}],

and I suppose there is no longer a description property, so perhaps we can't include a message about ARAXInfer extracting the edge from KG2 anyway, so we can't make it obvious that the edge was extracted by ARAXInfer. unless ARAXInfer had its own infores curie, so we could add it as another source (between kg2 and arax)...

chunyuma commented 1 year ago

ok, thanks @amykglen! Considering there are still lots of changes in NewFmt/TRAPI 1.4, I will not do this fix in NewFmt/TRAPI 1.4 but will do it in the master branch.

edeutsch commented 1 year ago

Good idea to fix/develop as much as possible that is not related to TRAPI 1.4 breaking changes in master. Although I have not done it in a while, it is the goal to keep merging master into NewFmt regularly.

edeutsch commented 1 year ago

and I suppose there is no longer a description property, so perhaps we can't include a message about ARAXInfer extracting the edge from KG2 anyway, so we can't make it obvious that the edge was extracted by ARAXInfer. unless ARAXInfer had its own infores curie, so we could add it as another source (between kg2 and arax)...

I would say that the main inferred "treats" edge should have a primary_knowledge_source of arax. But all the supporting edges in the auxiliary graph should have their primary_knowledge_sources originating with drugbank, etc.

amykglen commented 1 year ago

I would say that the main inferred "treats" edge should have a primary_knowledge_source of arax. But all the supporting edges in the auxiliary graph should have their primary_knowledge_sources originating with drugbank, etc.

yes, I completely agree. I was trying to think of a way that allows us to, in addition, continue to show on a supporting KG2 edge that that edge was extracted from KG2 by ARAXInfer vs. retrieved via the KG2 API, because that's very helpful for debugging reasons. but it's ok if we no longer have a way of indicating that with TRAPI 1.4

chunyuma commented 1 year ago

Hi @amykglen and @edeutsch, I have added the missing EPC attributes to the ARAXInfer-generated KG2 edges and also changed the message to This edge was extracted from RTX-KG2.7.3c by ARAXInfer. I tested it locally and it works.

Eric, could you please roll out the master branch to the server so that we can test it on /beta? Thank you! By the way, I've updated the database ExplainableDTD_v1.0_KG2.8.0.db under /home/rtxconfig/KG2.8.0/ on the arax-databases.rtx.ai server for this goal.

edeutsch commented 1 year ago

Hi @chunyuma I have deployed master to all endpoints that follow master including /beta. Please go ahead and test.

 code/ARAX/ARAXQuery/ARAX_database_manager.py           |   6 +-
 code/ARAX/ARAXQuery/ARAX_decorator.py                  | 133 ++++++++++++++++++---------
 code/ARAX/ARAXQuery/ARAX_infer.py                      |   1 +
 code/ARAX/ARAXQuery/Infer/scripts/ExplianableDTD_db.py |   2 +-
 code/ARAX/ARAXQuery/Infer/scripts/build_mapping_db.py  | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 code/ARAX/ARAXQuery/Infer/scripts/infer_utilities.py   | 260 +++++++++++++++++++++++++---------------------------
 code/ARAX/ARAXQuery/operation_to_ARAXi.py              |  17 +++-
 code/ARAX/test/test_ARAX_infer.py                      |   2 +-
 code/RTXConfiguration.py                               |   2 +-

I tried running the ARAX_database_manager but it did not seem to think anything needed to be updated. I don't know if that is a problem or expected.

amykglen commented 1 year ago

thanks @chunyuma! since you changed the database, you have to change the database name somehow (i.e., bump v1.0 --> v1.1) in order for the database manager to know to download the new database.

I'll go ahead and make this change now, and also upload the revised database to the ITRB SFTP server (after which I'll update config_dbs.json in master); at that point I think running the database manager should do the trick

amykglen commented 1 year ago

ok - I made those changes (renamed the updated db from v1.0 --> v1.1, uploaded it to the ITRB SFTP server, and changed config_dbs.json in master)

(I confirmed before doing those things that the ARAXInfer tests were failing in master, likely because the new code was using the old database; hopefully after this they pass)

so @edeutsch at this point if you roll out master and run the database manager, I think @chunyuma's fix should be deployed

amykglen commented 1 year ago

@chunyuma - using the new database, this test is failing in master:

FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'
edeutsch commented 1 year ago

Hi @amykglen @chunyuma I was about to roll out master and manage databases. But should I wait until the above test is addressed? Or full steam ahead?

amykglen commented 1 year ago

I would definitely say roll it out, since prior to my changes a whole bunch of XDTD tests were failing. now it's just one!

edeutsch commented 1 year ago

okay, I have rolled out master to all the endpoints following master (devED, devLM, /test, /beta, kg2beta) and run the database manager on all as well. It seems to have updated to 1.1 okay:

explainable_dtd_db has a local version, '1.0_KG2.8.0', which does not match the remote version, '1.1_KG2.8.0'.

The example query seems to be working although calls to the service provider seem to be failing, although probably not our fault: image

edeutsch commented 1 year ago

@chunyuma will look into this test:

FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'

Once all tests pass in master, then can close this issue

edeutsch commented 1 year ago

Confirmed that this is the only test failing now:

====================================== 1 failed, 116 passed, 125 skipped, 61904 warnings in 439.00s (0:07:18) =======================================
chunyuma commented 1 year ago

I've fixed this bug:

FAILED test_ARAX_expand.py::test_xdtd_different_predicates - AssertionError: assert ('ERROR' == 'OK'
amykglen commented 1 year ago

I think we can close this issue. XDTD KG2 edges now look great except for their sources, which I wrote up in #2033