Closed amykglen closed 2 years ago
alright, the synonymizer+KG2c build is ongoing on buildkg2c.rtx.ai
(in a screen
session, from the kg2integration
branch).
to kick off the build, all I did was 1) update (locally) RTX/code/kg2c/kg2c_config.json
to look like this:
{
"kg2pre_version": "2.7.4",
"kg2pre_neo4j_endpoint": "kg2endpoint3.rtx.ai",
"biolink_version": "2.2.6",
"upload_to_arax.ncats.io": true,
"upload_directory": "/data/orangeboard/databases/KG2.7.4",
"synonymizer": {
"build": true,
"name": "node_synonymizer_v1.0_KG2.7.4.sqlite"
},
"kg2c": {
"build": true,
"use_nlp_to_choose_descriptions": true,
"upload_to_s3": true,
"start_from_kg2c_json": false,
"use_local_kg2pre_tsvs": false
}
}
and 2) run:
python3 RTX/code/kg2c/build_kg2c.py
(note: I made sure to create an (empty) /data/orangeboard/databases/KG2.7.4
directory on arax.ncats.io before starting the build)
if all goes well the build should be done this evening (at which point I'll take care of loading it into Plover)
Thank you!
Just an FYI that in KG2pre, the edge property formerly called relation
is now called original_predicate
, per a change in Biolink 2.2 from Biolink 2.1. Not sure if this will break anything in the KG2c build process. Details in RTX-KG2 issue 165.
the synonymizer build completed successfully and things seem fine so far with that, but the KG2c build errored out while using BiolinkHelper
due to some strange mixin predicates in 2.2.6. fixed that issue and resumed the build.
alright, the new KG2c is ready in Neo4j: http://kg2-7-4c.rtx.ai:7474/browser/
everything looks fine so far on spot checking. upload to PloverDB is in progress.
KG2c has been loaded into Plover and all necessary downstream databases have been rebuilt. will test everything together tomorrow morning.
actually, ran the ARAX test suite tonight and all fast
tests passed on the first try! I'm impressed.
I'll do some deeper testing (e.g., Expand's slow
tests) tomorrow morning.
This is great! Thank you @amykglen !!
Hi @finnagin, do we need slim databases for Travis in this time point? Currently. due to the limited time, I only built the refreshed database but the full databases might need longer time. Just want to see if you also need to slim version for the refreshed database. Thanks!
@chunyuma We do still need those but since it's only used for testing and not the actual system I think we don't need to be sure to make the deadline for the slim database part. Though @amykglen, we will also need to come up with a way to generate slim kg2c and node synonymizer versions if we want Travis to run.
ah, yeah, I dropped the ball on the slim database thing. I added an agenda item for this week's AHM to touch base on that! (not a blocker for this KG2 rollout)
everything still looks good on further testing - one slow
DTD expand test is failing (test_dtd_expand_2
), though perhaps that would be fixed once the full DTD rebuild is done? maybe @chunyuma could take a look, but I don't think it's critical for the rollout...
Hi @amykglen, sorry for late response. For test_dtd_expand_2
, it seems like the error is from kg2c. Based on the query of this test case, it generates the query for neo4j but this query doesn't have any returns from kg2c neo4j. Could you please help take a look?
Here is the neo4j query for this test:
MATCH (n0:`biolink:SmallMolecule` {id:'CHEMBL.COMPOUND:CHEMBL112'})-[e0:`['biolink:related_to']`]-(n1) WHERE (n1:`biolink:Disease` OR n1:`biolink:DiseaseOrPhenotypicFeature` OR n1:`biolink:PhenotypicFeature`) WITH collect(distinct n0) as nodes_n0, collect(distinct n1) as nodes_n1, collect(distinct e0{.*, id:ID(e0), n0:n0.id, n1:n1.id}) as edges_e0 RETURN nodes_n0, nodes_n1, edges_e0
It returns nothing:
@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!
Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).
Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).
yep, we have an item for slim databases already
@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!
the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge
function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?
the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?
@amykglen, in DTD querier, it contains two modes "fast mode" and "slow mode". The "fast mode" is to query the DTD database directly while the "slow mode" is to call the DTD model and compute the drug repurposing probability on the fly. So when we use "slow mode", we need _get_cypher_for_query_edge
function to query the possible subject
node or the possible object
node based on the query_graph. Is it possible to query Plover for this goal?
Take test_dtd_expand_2
as example:
"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"
the "slow mode" needs to know what "n1" nodes should be paired with acetaminophen
to compute their probabilities by using the model.
so you mean you need to run the one-hop query on KG2 to get diseases connected to acetaminophen?
you can do that with Plover like so:
trapi_qg = {
"edges": {
"e00": {
"subject": "n00",
"object": "n01",
}
},
"nodes": {
"n00": {
"ids": ["CHEMBL.COMPOUND:CHEMBL112"]
},
"n01": {
"categories": ["biolink:Disease"]
}
}
}
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})
by default it will return answers in this format (only including node/edge IDs):
{
"edges":{
"e00":[
19308544,
26624039,
11296815,
12484663,
15564856,
9568317,
12222530,
23814212,
12222534,
11395143,
11214936,
16932955,
...
]
},
"nodes":{
"n00":[
"CHEMBL.COMPOUND:CHEMBL112"
],
"n01":[
"MESH:D014886",
"MONDO:0009323",
"MONDO:0020722",
"MONDO:0001384",
"MONDO:0003406",
"UMLS:C0429001",
"MESH:D010539",
"MONDO:0007254",
"MESH:D020078",
"CHEMBL.COMPOUND:CHEMBL326958",
"UMLS:C0375314",
"MONDO:0100053",
"MONDO:0005812",
"MONDO:0005010",
"MONDO:0001246",
"MONDO:0001046",
"MESH:D048949",
"MONDO:0002334",
"MONDO:0004553",
"MONDO:0007186",
"MESH:D014950",
"MONDO:0100192",
"UMLS:C0442797",
"UMLS:C0231225",
"MONDO:0005101",
"MONDO:0010667",
"MONDO:0001156",
...
]
}
}
but if you want more info included in the results you can add "include_metadata": True
to your query graph
Thanks @amykglen. If I only want all nodes with categories 'biolink:Disease' or 'biolink:DiseaseOrPhenotypicFeature' or 'biolink:PhenotypicFeature', I think the Plover can also do this by modifying the trapi_qg
like:
trapi_qg = {
"nodes": {
"n00": {
"categories": [ 'biolink:Disease', 'biolink:DiseaseOrPhenotypicFeature, 'biolink:PhenotypicFeature']
}
}
}
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})
Is it right?
yep!
or wait, so you're trying to get all disease-like nodes in KG2? (not just connected to acetaminophen?) not sure whether that would work...
@amykglen, yes, I'm thinking that the DTD expand should be independent of RTX-KG2c, right? This means there are some edges generated by DTD expand based on the DTD model with probability > certain threshold which might exist in the RTX-KG2c. So back to the acetaminophen
case, the DTD expand should consider all disease-like nodes in KG2 and then use DTD model to calculate the probabilities and then expand the edges, right?
ah, ok, I didn't realize you're looking up all disease-like nodes. yeah, that won't work with Plover.
so you really only need to get the list of all disease-like node IDs once, right? (for each KG2 version.) not on every query?
could you do that during building of DTD? (and then just store the list of IDs in one of your DTD databases, or a separate database if you preferred, which could be added to the database manger)
Actually, not just all disease-like node IDs. The reason it is a list of all disease-like node IDs is because in this query, we try to expand n0:'acetaminophen' to n1:'disease-like' nodes:
"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"
Perhaps in other queries, people are interested in the expand for acetaminophen to other categories via DTD expand. (Note that currently we don't check the category provided by the user via slow mode. In other words, it is allowed that people can provide any kinds of categories via DTD expand.). So actually, we need a function that can extract all nodes corresponding to the categories provided by the user. Do you think it is feasible?
I think we can pre-store the ID list corresponding to different categories. However, I'm not sure if we need to consider the hierarchical relation. For example, if the user set add_qnode(categories=biolink:ChemicalEntity, key=n1)
, we need to use all nodes corresponding to biolink:ChemicalEntity
and also include all its children. I think this is what exapnd is currently doing in RTX-KG2, right?
I think that's right that you would want to do hierarchical reasoning for these category ID lists. if you query the KG2c neo4j by label (e.g., match (n:biolink:ChemicalEntity) return n.id
), that reasoning should be done for you (since nodes are labeled with their direct categories as well as the ancestors of those categories).
(Plover can't help here since it doesn't currently allow queries where no qnode is "pinned")
hey @finnagin - have you updated the test triples (for the NCATS repo) for KG2.7.4 yet?
The pull request for updating the test triples is now in the NCATSTranslator/testing repo.
Closing as the smart api registry looks to be updated and everything else nopt marked as able to be skipped has been checked
1. Build and load KG2c:
kg2integration
branch)kg2integration
branch)kg2c_lite_2.X.Y.json.gz
file to the translator-lfs-artifacts repo2. Rebuild downstream databases:
Copies of all of these should be put in
/data/orangeboard/databases/KG2.X.Y
on arax.ncats.io.config_local.json
, since we want it to be used overconfigv2.json
during testing/home/ubuntu/kg2-build/kg2c.dump
)NOTE: As databases are rebuilt, the new copy of
config_local.json
will need to be updated to point to their new paths. However, if the rollout of KG2 has already occurred, then you should update the masterconfigv2.json
directly.3. Update the ARAX codebase:
Associated code changes should go in the
kg2integration
branch.BiolinkHelper
uses the right version)config_local.json
- must locally setforce_local = True
inARAX_expander.py
to avoid using the old KG2 API)4. Do the rollout:
master
intokg2integration
kg2integration
intomaster
config_local.json
the new master config file on araxconfig.rtx.ai (rename it toconfigv2.json
)master
out to the various arax.ncats.io endpoints and delete theirconfigv2.json
s5. Final items/clean up:
config_local.json
on arax.ncats.io toconfig_local.json_FROZEN_DO-NOT-EDIT-FURTHER
(any additional edits to the config file should be made directly to the masterconfigv2.json
on araxconfig.rtx.ai going forward)kg_config.json
in themain
branch of the Plover repo to point to the newkg2c_lite_2.X.Y.json.gz
file (push this change)config_local.json
that points to it and locally setforce_local = True
in Expandconfigv2.json
on araxconfig.rtx.ai to point to this Plover endpoint (used by beta endpoints)kg2
endpoint'sconfigv2.json
to force it to download the new copy and then verify it's working correctly by running a query