RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
34 stars 20 forks source link

Roll out KG2.7.4 (Biolink 2.2.6) #1728

Closed amykglen closed 2 years ago

amykglen commented 2 years ago
1. Build and load KG2c:
2. Rebuild downstream databases:

Copies of all of these should be put in /data/orangeboard/databases/KG2.X.Y on arax.ncats.io.

NOTE: As databases are rebuilt, the new copy of config_local.json will need to be updated to point to their new paths. However, if the rollout of KG2 has already occurred, then you should update the master configv2.json directly.

3. Update the ARAX codebase:

Associated code changes should go in the kg2integration branch.

4. Do the rollout:
5. Final items/clean up:
amykglen commented 2 years ago

alright, the synonymizer+KG2c build is ongoing on buildkg2c.rtx.ai (in a screen session, from the kg2integration branch).

to kick off the build, all I did was 1) update (locally) RTX/code/kg2c/kg2c_config.json to look like this:

{
  "kg2pre_version": "2.7.4",
  "kg2pre_neo4j_endpoint": "kg2endpoint3.rtx.ai",
  "biolink_version": "2.2.6",
  "upload_to_arax.ncats.io": true,
  "upload_directory": "/data/orangeboard/databases/KG2.7.4",
  "synonymizer": {
    "build": true,
    "name": "node_synonymizer_v1.0_KG2.7.4.sqlite"
  },
  "kg2c": {
    "build": true,
    "use_nlp_to_choose_descriptions": true,
    "upload_to_s3": true,
    "start_from_kg2c_json": false,
    "use_local_kg2pre_tsvs": false
  }
}

and 2) run:

python3 RTX/code/kg2c/build_kg2c.py

(note: I made sure to create an (empty) /data/orangeboard/databases/KG2.7.4 directory on arax.ncats.io before starting the build)

amykglen commented 2 years ago

if all goes well the build should be done this evening (at which point I'll take care of loading it into Plover)

saramsey commented 2 years ago

Thank you!

saramsey commented 2 years ago

Just an FYI that in KG2pre, the edge property formerly called relation is now called original_predicate, per a change in Biolink 2.2 from Biolink 2.1. Not sure if this will break anything in the KG2c build process. Details in RTX-KG2 issue 165.

amykglen commented 2 years ago

the synonymizer build completed successfully and things seem fine so far with that, but the KG2c build errored out while using BiolinkHelper due to some strange mixin predicates in 2.2.6. fixed that issue and resumed the build.

amykglen commented 2 years ago

alright, the new KG2c is ready in Neo4j: http://kg2-7-4c.rtx.ai:7474/browser/

everything looks fine so far on spot checking. upload to PloverDB is in progress.

amykglen commented 2 years ago

KG2c has been loaded into Plover and all necessary downstream databases have been rebuilt. will test everything together tomorrow morning.

amykglen commented 2 years ago

actually, ran the ARAX test suite tonight and all fast tests passed on the first try! I'm impressed.

I'll do some deeper testing (e.g., Expand's slow tests) tomorrow morning.

saramsey commented 2 years ago

This is great! Thank you @amykglen !!

chunyuma commented 2 years ago

Hi @finnagin, do we need slim databases for Travis in this time point? Currently. due to the limited time, I only built the refreshed database but the full databases might need longer time. Just want to see if you also need to slim version for the refreshed database. Thanks!

finnagin commented 2 years ago

@chunyuma We do still need those but since it's only used for testing and not the actual system I think we don't need to be sure to make the deadline for the slim database part. Though @amykglen, we will also need to come up with a way to generate slim kg2c and node synonymizer versions if we want Travis to run.

amykglen commented 2 years ago

ah, yeah, I dropped the ball on the slim database thing. I added an agenda item for this week's AHM to touch base on that! (not a blocker for this KG2 rollout)

amykglen commented 2 years ago

everything still looks good on further testing - one slow DTD expand test is failing (test_dtd_expand_2), though perhaps that would be fixed once the full DTD rebuild is done? maybe @chunyuma could take a look, but I don't think it's critical for the rollout...

chunyuma commented 2 years ago

Hi @amykglen, sorry for late response. For test_dtd_expand_2, it seems like the error is from kg2c. Based on the query of this test case, it generates the query for neo4j but this query doesn't have any returns from kg2c neo4j. Could you please help take a look?

Here is the neo4j query for this test:

MATCH (n0:`biolink:SmallMolecule` {id:'CHEMBL.COMPOUND:CHEMBL112'})-[e0:`['biolink:related_to']`]-(n1) WHERE (n1:`biolink:Disease` OR n1:`biolink:DiseaseOrPhenotypicFeature` OR n1:`biolink:PhenotypicFeature`) WITH collect(distinct n0) as nodes_n0, collect(distinct n1) as nodes_n1, collect(distinct e0{.*, id:ID(e0), n0:n0.id, n1:n1.id}) as edges_e0 RETURN nodes_n0, nodes_n1, edges_e0

It returns nothing: Screen Shot 2021-11-16 at 4 32 34 PM

chunyuma commented 2 years ago

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

saramsey commented 2 years ago

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

amykglen commented 2 years ago

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

yep, we have an item for slim databases already

amykglen commented 2 years ago

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

chunyuma commented 2 years ago

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

@amykglen, in DTD querier, it contains two modes "fast mode" and "slow mode". The "fast mode" is to query the DTD database directly while the "slow mode" is to call the DTD model and compute the drug repurposing probability on the fly. So when we use "slow mode", we need _get_cypher_for_query_edge function to query the possible subject node or the possible object node based on the query_graph. Is it possible to query Plover for this goal?

Take test_dtd_expand_2 as example:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

the "slow mode" needs to know what "n1" nodes should be paired with acetaminophen to compute their probabilities by using the model.

amykglen commented 2 years ago

so you mean you need to run the one-hop query on KG2 to get diseases connected to acetaminophen?

you can do that with Plover like so:

trapi_qg = {
        "edges": {
            "e00": {
                "subject": "n00",
                "object": "n01",
            }
        },
        "nodes": {
            "n00": {
                "ids": ["CHEMBL.COMPOUND:CHEMBL112"]
            },
            "n01": {
                "categories": ["biolink:Disease"]
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

by default it will return answers in this format (only including node/edge IDs):

{
   "edges":{
      "e00":[
         19308544,
         26624039,
         11296815,
         12484663,
         15564856,
         9568317,
         12222530,
         23814212,
         12222534,
         11395143,
         11214936,
         16932955,
          ...
      ]
   },
   "nodes":{
      "n00":[
         "CHEMBL.COMPOUND:CHEMBL112"
      ],
      "n01":[
         "MESH:D014886",
         "MONDO:0009323",
         "MONDO:0020722",
         "MONDO:0001384",
         "MONDO:0003406",
         "UMLS:C0429001",
         "MESH:D010539",
         "MONDO:0007254",
         "MESH:D020078",
         "CHEMBL.COMPOUND:CHEMBL326958",
         "UMLS:C0375314",
         "MONDO:0100053",
         "MONDO:0005812",
         "MONDO:0005010",
         "MONDO:0001246",
         "MONDO:0001046",
         "MESH:D048949",
         "MONDO:0002334",
         "MONDO:0004553",
         "MONDO:0007186",
         "MESH:D014950",
         "MONDO:0100192",
         "UMLS:C0442797",
         "UMLS:C0231225",
         "MONDO:0005101",
         "MONDO:0010667",
         "MONDO:0001156",
           ...
      ]
   }
}

but if you want more info included in the results you can add "include_metadata": True to your query graph

chunyuma commented 2 years ago

Thanks @amykglen. If I only want all nodes with categories 'biolink:Disease' or 'biolink:DiseaseOrPhenotypicFeature' or 'biolink:PhenotypicFeature', I think the Plover can also do this by modifying the trapi_qg like:

trapi_qg = {
        "nodes": {
            "n00": {
                "categories": [ 'biolink:Disease', 'biolink:DiseaseOrPhenotypicFeature, 'biolink:PhenotypicFeature']
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

Is it right?

amykglen commented 2 years ago

yep!

amykglen commented 2 years ago

or wait, so you're trying to get all disease-like nodes in KG2? (not just connected to acetaminophen?) not sure whether that would work...

chunyuma commented 2 years ago

@amykglen, yes, I'm thinking that the DTD expand should be independent of RTX-KG2c, right? This means there are some edges generated by DTD expand based on the DTD model with probability > certain threshold which might exist in the RTX-KG2c. So back to the acetaminophen case, the DTD expand should consider all disease-like nodes in KG2 and then use DTD model to calculate the probabilities and then expand the edges, right?

amykglen commented 2 years ago

ah, ok, I didn't realize you're looking up all disease-like nodes. yeah, that won't work with Plover.

so you really only need to get the list of all disease-like node IDs once, right? (for each KG2 version.) not on every query?

could you do that during building of DTD? (and then just store the list of IDs in one of your DTD databases, or a separate database if you preferred, which could be added to the database manger)

chunyuma commented 2 years ago

Actually, not just all disease-like node IDs. The reason it is a list of all disease-like node IDs is because in this query, we try to expand n0:'acetaminophen' to n1:'disease-like' nodes:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

Perhaps in other queries, people are interested in the expand for acetaminophen to other categories via DTD expand. (Note that currently we don't check the category provided by the user via slow mode. In other words, it is allowed that people can provide any kinds of categories via DTD expand.). So actually, we need a function that can extract all nodes corresponding to the categories provided by the user. Do you think it is feasible?

I think we can pre-store the ID list corresponding to different categories. However, I'm not sure if we need to consider the hierarchical relation. For example, if the user set add_qnode(categories=biolink:ChemicalEntity, key=n1), we need to use all nodes corresponding to biolink:ChemicalEntity and also include all its children. I think this is what exapnd is currently doing in RTX-KG2, right?

amykglen commented 2 years ago

I think that's right that you would want to do hierarchical reasoning for these category ID lists. if you query the KG2c neo4j by label (e.g., match (n:biolink:ChemicalEntity) return n.id), that reasoning should be done for you (since nodes are labeled with their direct categories as well as the ancestors of those categories).

(Plover can't help here since it doesn't currently allow queries where no qnode is "pinned")

amykglen commented 2 years ago

hey @finnagin - have you updated the test triples (for the NCATS repo) for KG2.7.4 yet?

finnagin commented 2 years ago

The pull request for updating the test triples is now in the NCATSTranslator/testing repo.

finnagin commented 2 years ago

Closing as the smart api registry looks to be updated and everything else nopt marked as able to be skipped has been checked