RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 20 forks source link

Stand up "creative DTD" endpoint #1818

Closed dkoslicki closed 2 years ago

dkoslicki commented 2 years ago

@chunyuma will create a class/method/script/function that has the following structure: Input: A single disease CURIE and two integers M & N output: the top M drugs predicted to treat the disease, along with N explanation paths for each drug This will be using his reinforcement learning model.

@finnagin will stand up an endpoint to arax.ncats.io (with a name something like "CreativeDTD") that will: Take as input a TRAPI query structured like: (x)-[r]-(y) where x is biolink:ChemicalEntity (or any of its descendants), r is any biolink relationship (effectively ignoring the relationship type) and y is biolink:DiseaseOrPhenotypicFeature (or any of its descendants). Everything else will be ignored (including the workflow portion of TRAPI). As output, it will give a standard TRAPI response. The only nuance here is that the paths that Chunyu's method returns can have variable length: anywhere from 1 to 3 hops. As such, the query graph associated with this may need to be something like: (x)-[r_opt1]-(y) (x)-[r_opt2]-()-[r_opt2]-(y) (x)-[r_opt3]-()-[r_opt3]-()-[r_opt3]-(y) Or something similar to communicate 1 to 3 hops. Note that the current constraint of expand requiring at least one non-optional edge shouldn't matter here as expand will not be used.

Timeline: Preliminary implementation by May 3, production ready by May 31. LMK if this timeline is reasonable (of course, the earlier the better, but there are other priorities each of you have as well).

edeutsch commented 2 years ago

I'm thinking that it would easier to implement this with a suitable "knowledge type" flag/constraint (#1815) on [r] using the standard endpoint instead of a separate endpoint.

dkoslicki commented 2 years ago

@edeutsch I was imagining a separate endpoint as the output format (and input that triggers it) may change, and figured it would be nice to keep a separate endpoint to fiddle with. I do like your idea of indicating what type of result you want by using flags/constraints. I figure, though, that it will be much faster to implement this one type of query via an operation like infer() rather than trying to get everyone to agree on what flags/constraints to use.

dkoslicki commented 2 years ago

Oh, and another thing @edeutsch: we're going to host some clinical data working group workflows, so I thought we could use the normal route (and query graph interpreter) for that. I was also going to ask during the next AHM if people are cool with us listing this endpoint in the manuscript Chunyu is publishing. Basically for those who want to see it in action, but don't want to learn ARAXi/full TRAPI/Translator to use it.

finnagin commented 2 years ago

@edeutsch Fyi, I made the branch issue1818 for work on this.

chunyuma commented 2 years ago

Hi @finnagin, I've written some scripts to run my model and uploaded them with the necessary data to the arax.ncats.io server:/data/code_creative_DTD_endpoint. Since some datasets are large, I didn't push them to this repo. There is a temple called run_template.py within the scripts folder for running the scripts. Please let me know if you meet any problems.

Here are some statistics for generating the top 20 paths for the top 50 drugs predicted to treat Alkaptonuria (MONDO:0008753). The results are here.

Memory cost: 40 - 50 GB Running time with CPU: 1158.4289071559906 s Running time with GPU: 1021.6582832336426 s

finnagin commented 2 years ago

Thanks @chunyuma ! Does the model usually take that much ram to make predictions?

I don't think arax.ncats.io will have enough extra ram to run that so we might need to spin up another ec2 instance instead of running as a separate service on arax.ncats.io.

chunyuma commented 2 years ago

@finnagin, right. It generally needs that ram to make predictions because it needs to read in some pre-training embeddings and the whole RTX-KG2c. Perhaps we might need the thoughts of @edeutsch, @saramsey, or @dkoslicki to see if it is worthy to set up another ec2 instance for this model.

edeutsch commented 2 years ago

running it on arax.ncats.io itself seems risky because the container itself has a limit of around 55 GB.

finnagin commented 2 years ago

From meeting with Amy:

  1. Get prefered node ids using this: https://github.com/RTXteam/RTX/blob/2f40f2dc3e1a292d3aad62b067282f7757e0cbe2/code/ARAX/NodeSynonymizer/node_synonymizer.py#L1926-L2099
  2. example of the above here: https://github.com/RTXteam/RTX/blob/2f40f2dc3e1a292d3aad62b067282f7757e0cbe2/code/ARAX/ARAXQuery/Expand/expand_utilities.py#L312-L332
  3. Instantiate the class from trapi_querier.py with the kp name "infores:rtx-kg2"
  4. Use the method _get_arax_edge_key from the trapi querier class to get the correct edge key
  5. Add edge attribute that specified here: https://github.com/RTXteam/RTX/blob/2f40f2dc3e1a292d3aad62b067282f7757e0cbe2/code/ARAX/ARAXQuery/Expand/kg2_querier.py#L226-L229
  6. After adding all edges to the knowledge graph instantiate arax decortator class.
  7. pass the whole response and use the method decorate_nodes and decorate_edges to get metadata. (This requires the edge attribute that specifies it came form kg2)
dkoslicki commented 2 years ago

Issue: nodes returned by name from the model Solution: model should return CURIES and names instead. Instead of a string format encoding the paths, have them encoded via (V,E) (vertices and edges) that way properties (like CURIES and names) can decorate the nodes and edges. Tagging @chunyuma so he's aware

dkoslicki commented 2 years ago

See brain dump file for current state and what to do moving forward

amykglen commented 2 years ago

@dkoslicki - hmm, so just now when I use the KG2.7.6 config_local.json and run the latest code in the issue1818 branch, the database manager seems to download the XDTD database every time it runs a pytest query, even though I've verified it is indeed downloaded locally in the right place.

I wonder if the problem could be that there's no entry for explainable_dtd_db here? https://github.com/RTXteam/RTX/blob/b7107c38cbe0043b92a035188d306a8f3bbe192e/code/ARAX/ARAXQuery/ARAX_database_manager.py#L106-L147

dkoslicki commented 2 years ago

Odd, it didn't do that for me, but I've updated that portion of the code now

amykglen commented 2 years ago

seems like that did the trick!

dkoslicki commented 2 years ago

@chunyuma Take a look at 6e33cc1. Should like 290 be replaced with ExplainableDTD?

chunyuma commented 2 years ago

@chunyuma Take a look at 6e33cc1. Should like 290 be replaced with ExplainableDTD?

Oh, right. That should be replaced with ExplainableDTD. It is my fault. I forget to change it.

chunyuma commented 2 years ago

@dkoslicki - hmm, so just now when I use the KG2.7.6 config_local.json and run the latest code in the issue1818 branch, the database manager seems to download the XDTD database every time it runs a pytest query, even though I've verified it is indeed downloaded locally in the right place.

I wonder if the problem could be that there's no entry for explainable_dtd_db here?

https://github.com/RTXteam/RTX/blob/b7107c38cbe0043b92a035188d306a8f3bbe192e/code/ARAX/ARAXQuery/ARAX_database_manager.py#L106-L147

@amykglen and @dkoslicki, I think we can't just change the ARAX_database_manager.py. We also need to change this line and make it compatible with ARAX_database_manager.py.

amykglen commented 2 years ago

just to clarify - should I be calling the ARAXInfer class from Expand? (vs. ExplainableDTD itself?)

and are we thinking ARAXInfer should do the work of deciding which infer action to apply to a given knowledge_type: inferred query edge? (right now we only have xdtd, but it looks like ARAXInfer is set up to handle more actions down the line?)

in other words, if Expand sees a query edge with knowledge_type: inferred, should it always send that to ARAXInfer, or should it only sendARAXInfer query edges with the biolink:treats predicate?

dkoslicki commented 2 years ago

An ARAXi command was just the first thought we had for creative modes. Likely we will continue to do both moving forward, similar to how we can query COHD for expansion, or use it as an overlay. Does that answer your question, or were you thinking something else? I figured how you do it in code (calling the module or ARAXi) should be whatever makes sense to you

amykglen commented 2 years ago

ok. does the ExplainableDTD module have a way of returning TRAPI? if not, I think ARAXInfer is the better option.. otherwise I'd wind up duplicating much of the work in ARAXInfer I think?

dkoslicki commented 2 years ago

Apologies for the late reply; iirc, only ARAXInfer has the TRAPI conversion logic

amykglen commented 2 years ago

ok, got it. so I just pushed a most rudimentary way of calling XDTD from Expand to the issue1818 branch - basically Expand just gives any one-hop queries where the qedge has biolink:treats and knowledge_type == inferred to ARAXInfer, and then returns what ARAXinfer does.

however, this is erroring out at the moment - you can see this by running pytest -vs test_ARAX_expand.py -k test_xdtd_expand in the issue1818 branch.

I'm having trouble figuring out what input_parameters ARAXInfer.apply() expects. it currently throws an uncaught error, complaining about a node_curie parameter. but when I try to pass a value in for node_curie (by setting it equal to the object node's list of curies), this warning is given:

2022-07-01T18:05:28.739332 WARNING: [] Supplied value MONDO:0004975 is not permitted. In action {'drug_treatment_graph_expansion'}, allowable values to node_curie are: [None]

and it seems like the returned KG is empty.

dkoslicki commented 2 years ago

Ok, so a couple things @amykglen : it looks like the curie on the subject node of the inferred edge needs to be passed to ARAXInfer. Something like: infer_input_parameters = {"action": "drug_treatment_graph_expansion",'node_curie': subject_curie} but I'm still figuring out how to get the subject_curie in ARAX_expand. Second, even after hard-coding a CURIE in, I'm getting a weird unpacking issue: ERROR: [UncaughtARAXiError] An uncaught error occurred: cannot unpack non-iterable ARAXResponse object: ['Traceback (most recent call last):\n', ' File "/home/dkoslicki/Desktop/RTX/code/ARAX/test/../ARAXQuery/ARAX_query.py", line 730, in execute_processing_plan\n expander.apply(response, action[\'parameters\'], mode=mode)\n', ' File "/home/dkoslicki/Desktop/RTX/code/ARAX/test/../ARAXQuery/ARAX_expander.py", line 250, in apply\n infer_response = inferer.apply(response, infer_input_parameters)\n', ' File "/home/dkoslicki/Desktop/RTX/code/ARAX/test/../ARAXQuery/ARAX_infer.py", line 215, in apply\n getattr(self, \'_\' + self.__class__.__name__ + \'__\' + parameters[\'action\'])() # thank you https://stackoverflow.com/questions/11649848/call-methods-by-string\n', ' File "/home/dkoslicki/Desktop/RTX/code/ARAX/test/../ARAXQuery/ARAX_infer.py", line 304, in __drug_treatment_graph_expansion\n self.response, self.kedge_global_iter, self.qedge_global_iter, self.qnode_global_iter, self.option_global_iter = iu.genrete_treat_subgraphs(self.response, top_drugs, top_paths, self.kedge_global_iter, self.qedge_global_iter, self.qnode_global_iter, self.option_global_iter)\n', 'TypeError: cannot unpack non-iterable ARAXResponse object\n'] Is expand using multiprocessing? I've seen stuff like this error come up before when trying to multiprocess on non-pickleable items.

dkoslicki commented 2 years ago

And the reason why I think it might be something expand specific, is that the pytest -v test_ARAX_infer.py test passes

dkoslicki commented 2 years ago

Multiprocessing may be a red herring: it looks like Finn's code in infer_utilities.py expects the query graph to be empty, as it populates the QG here. Looks like we will need to either a) pass the query edge and nodes that have the inferred property on it to infer_utilities so it knows where to do its edits or b) don't edit the QG and just be fine with the results not matching the QG I figure option b) may cause issues with resultify if a QG is given with more edges than the inferred one.

dkoslicki commented 2 years ago

@amykglen I think I've fixed everything, as all tests appear to be passing now. I had to do some jiggering with node categories, which nodes and edges are marked as filled, and inserting the optional_group_ids in their proper place if a QG already exists.

LMK if things look good to you

amykglen commented 2 years ago

awesome, yep, things are looking good to me!

thanks for figuring out the node_curie thing. maybe eventually I'll add a wrapper function somewhere so that the interface around ARAXInfer/XDTD is more convenient for Expand (e.g., ideally Expand could just pass in a QG, like it does for other KPs).

I'm planning to work on handling multi-qedge inferred queries over the next few days, but it may take me a little time as that makes things quite a bit more complex in Expand (having to merge answers/QGs and etc.) I'm thinking I'll do that in a branch off of issue1818 so that we can still merge issue1818 into master whenever we're ready and have a functioning creative mode, at least for simple single-qedge inferred queries.

random question: does XDTD do any subclass_of reasoning? so if you asked for treatments for Adams-Oliver syndrome it would also give you treatments for Adams-Oliver syndrome 2?

dkoslicki commented 2 years ago

Sounds good; mixed inferred and lookup edges is ahead of the curve as only the template (single inferred edge) is required by Tuesday, so there's plenty of time to do mixed knowledge types

Re: subclass reasoning, no, it only does the inference for the exact curie supplied.

edeutsch commented 2 years ago

So I am ready/trying to deploy this for dev/testing, but seeing this error:

  File "/mnt/data/orangeboard/devED/RTX/code/RTXConfiguration.py", line 162, in live
    self.explainable_dtd_db_host = self.config["Global"]["explainable_dtd_db"]["host"]
KeyError: 'explainable_dtd_db'

I'm hoping @amykglen or someone can update the central configv2? and that should fix it?

dkoslicki commented 2 years ago

Yes, an updated to configv2 should fix it. I just don't know which machine has the "authoritative" version of it. Perhaps @amykglen knows (and I'd like to know too!)

amykglen commented 2 years ago

the authoritative configv2.json lives on araxconfig.rtx.ai - I'll put the new configv2.json that David shared in slack on there now

amykglen commented 2 years ago

ok, the authoritative configv2.json has been updated now - so if you delete yours to force a redownload it should work

amykglen commented 2 years ago

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

edeutsch commented 2 years ago

okay, we are deployed to /test and /beta. The endpoints pass our basic test query. I have not tested creative mode. Please test if you have time. I am slammed for the rest of the day.

dkoslicki commented 2 years ago

Tested and is looking good! I will want to make some changes later, (ala #1862), but I think it's fine to roll out to all endpoints.

edeutsch commented 2 years ago

okay, will do, even production?

edeutsch commented 2 years ago

note that when rolling out to prod we'll have to edit the config_local.json that prod uses

Can we do this together at the hackathon tomorrow?

dkoslicki commented 2 years ago

Yup, considering the UI team is expecting creative results, we should roll it out everywhere

amykglen commented 2 years ago

sure, sounds good to me!

dkoslicki commented 2 years ago

Deployed everywhere, so closing