Roll out KG2.7.4 (Biolink 2.2.6)

amykglen commented 2 years ago

1. Build and load KG2c:

[x] build a synonymizer from the new KG2 (in the kg2integration branch)
- [x] make sure synonymizer uses Biolink 2.2.6!
[x] build a new KG2c (in the kg2integration branch)
[x] load the new KG2c into neo4j at http://kg2-X-Yc.rtx.ai:7474/browser/)
[x] upload the new kg2c_lite_2.X.Y.json.gz file to the translator-lfs-artifacts repo
[x] load the new KG2c into plover (available at http://kg2-X-Ycplover.rtx.ai:9990)

2. Rebuild downstream databases:

Copies of all of these should be put in /data/orangeboard/databases/KG2.X.Y on arax.ncats.io.

[x] configv2.json (should point to the new KG2/KG2c/plover)
- note: save this as config_local.json, since we want it to be used over configv2.json during testing
[x] NodeSynonymizer
[x] KG2c meta knowledge graph
[x] KG2c sqlite
[x] KG2c TSV tarball
[x] KG2c neo4j dump (this is created on the neo4j hosting instance when loading KG2c, at /home/ubuntu/kg2-build/kg2c.dump)
[x] FDA-approved drugs pickle
[x] NGD database
[x] COHD database @chunyuma
[x] refreshed DTD @chunyuma
[ ] DTD model @chunyuma (may be skipped - depends on the KG2 version)
[ ] DTD database @chunyuma (may be skipped - depends on the KG2 version)
[ ] 'slim' databases (used for Travis) @chunyuma / @finnagin

NOTE: As databases are rebuilt, the new copy of config_local.json will need to be updated to point to their new paths. However, if the rollout of KG2 has already occurred, then you should update the master configv2.json directly.

3. Update the ARAX codebase:

Associated code changes should go in the kg2integration branch.

[x] update the Biolink version number (to 2.2.6) and KG2 version number (to 2.7.4) in the openapi yaml @edeutsch?
- [x] update Biolink version in ARAX OpenAPI yaml (so that BiolinkHelper uses the right version)
[x] update Expand code as needed
[x] update any other modules as needed
[x] test everything together (entire ARAX pytest suite should pass when using the new config_local.json - must locally set force_local = True in ARAX_expander.py to avoid using the old KG2 API)

4. Do the rollout:

[x] merge master into kg2integration
[x] merge kg2integration into master
[x] make config_local.json the new master config file on araxconfig.rtx.ai (rename it to configv2.json)
[x] roll master out to the various arax.ncats.io endpoints and delete their configv2.jsons
[x] run the pytest suite on the various endpoints

5. Final items/clean up:

[ ] update SmartAPI registration for KG2 @edeutsch
[x] update the test triples that go in some NCATS repo @finnagin
[x] rename the config_local.json on arax.ncats.io to config_local.json_FROZEN_DO-NOT-EDIT-FURTHER (any additional edits to the config file should be made directly to the master configv2.json on araxconfig.rtx.ai going forward)
[x] turn off the old KG2c version's neo4j instance
[x] turn off the old KG2pre version's neo4j instance
[x] turn off the old KG2 version's plover instance
[x] upgrade the NCATS-hosted Plover endpoint (https://kg2cploverdb.ci.transltr.io) to this KG2 version and make the KG2 API start using it (instead of our self-hosted endpoint):
- [x] update kg_config.json in the main branch of the Plover repo to point to the new kg2c_lite_2.X.Y.json.gz file (push this change)
- [x] wait about 45 minutes for the endpoint to rebuild and then run Plover tests to verify it's working
- [x] run the ARAX pytest suite with the NCATS endpoint plugged in: use a config_local.json that points to it and locally set force_local = True in Expand
- [x] if all tests pass, update the master configv2.json on araxconfig.rtx.ai to point to this Plover endpoint (used by beta endpoints)
- [x] also update production ARAX/KG2's config files to point to this Plover endpoint
- [x] delete the arax.ncats.io kg2 endpoint's configv2.json to force it to download the new copy and then verify it's working correctly by running a query
- [x] verify production ARAX/KG2 are working as well
- [x] turn off our plover endpoint and verify once more that ARAX is still working ok

amykglen commented 2 years ago

alright, the synonymizer+KG2c build is ongoing on buildkg2c.rtx.ai (in a screen session, from the kg2integration branch).

to kick off the build, all I did was 1) update (locally) RTX/code/kg2c/kg2c_config.json to look like this:

{
  "kg2pre_version": "2.7.4",
  "kg2pre_neo4j_endpoint": "kg2endpoint3.rtx.ai",
  "biolink_version": "2.2.6",
  "upload_to_arax.ncats.io": true,
  "upload_directory": "/data/orangeboard/databases/KG2.7.4",
  "synonymizer": {
    "build": true,
    "name": "node_synonymizer_v1.0_KG2.7.4.sqlite"
  },
  "kg2c": {
    "build": true,
    "use_nlp_to_choose_descriptions": true,
    "upload_to_s3": true,
    "start_from_kg2c_json": false,
    "use_local_kg2pre_tsvs": false
  }
}

and 2) run:

python3 RTX/code/kg2c/build_kg2c.py

(note: I made sure to create an (empty) /data/orangeboard/databases/KG2.7.4 directory on arax.ncats.io before starting the build)

amykglen commented 2 years ago

if all goes well the build should be done this evening (at which point I'll take care of loading it into Plover)

saramsey commented 2 years ago

Thank you!

saramsey commented 2 years ago

Just an FYI that in KG2pre, the edge property formerly called relation is now called original_predicate, per a change in Biolink 2.2 from Biolink 2.1. Not sure if this will break anything in the KG2c build process. Details in RTX-KG2 issue 165.

amykglen commented 2 years ago

the synonymizer build completed successfully and things seem fine so far with that, but the KG2c build errored out while using BiolinkHelper due to some strange mixin predicates in 2.2.6. fixed that issue and resumed the build.

amykglen commented 2 years ago

alright, the new KG2c is ready in Neo4j: http://kg2-7-4c.rtx.ai:7474/browser/

everything looks fine so far on spot checking. upload to PloverDB is in progress.

amykglen commented 2 years ago

KG2c has been loaded into Plover and all necessary downstream databases have been rebuilt. will test everything together tomorrow morning.

amykglen commented 2 years ago

actually, ran the ARAX test suite tonight and all fast tests passed on the first try! I'm impressed.

I'll do some deeper testing (e.g., Expand's slow tests) tomorrow morning.

saramsey commented 2 years ago

This is great! Thank you @amykglen !!

chunyuma commented 2 years ago

Hi @finnagin, do we need slim databases for Travis in this time point? Currently. due to the limited time, I only built the refreshed database but the full databases might need longer time. Just want to see if you also need to slim version for the refreshed database. Thanks!

finnagin commented 2 years ago

@chunyuma We do still need those but since it's only used for testing and not the actual system I think we don't need to be sure to make the deadline for the slim database part. Though @amykglen, we will also need to come up with a way to generate slim kg2c and node synonymizer versions if we want Travis to run.

amykglen commented 2 years ago

ah, yeah, I dropped the ball on the slim database thing. I added an agenda item for this week's AHM to touch base on that! (not a blocker for this KG2 rollout)

amykglen commented 2 years ago

everything still looks good on further testing - one slow DTD expand test is failing (test_dtd_expand_2), though perhaps that would be fixed once the full DTD rebuild is done? maybe @chunyuma could take a look, but I don't think it's critical for the rollout...

chunyuma commented 2 years ago

Hi @amykglen, sorry for late response. For test_dtd_expand_2, it seems like the error is from kg2c. Based on the query of this test case, it generates the query for neo4j but this query doesn't have any returns from kg2c neo4j. Could you please help take a look?

Here is the neo4j query for this test:

MATCH (n0:`biolink:SmallMolecule` {id:'CHEMBL.COMPOUND:CHEMBL112'})-[e0:`['biolink:related_to']`]-(n1) WHERE (n1:`biolink:Disease` OR n1:`biolink:DiseaseOrPhenotypicFeature` OR n1:`biolink:PhenotypicFeature`) WITH collect(distinct n0) as nodes_n0, collect(distinct n1) as nodes_n1, collect(distinct e0{.*, id:ID(e0), n0:n0.id, n1:n1.id}) as edges_e0 RETURN nodes_n0, nodes_n1, edges_e0

It returns nothing: Screen Shot 2021-11-16 at 4 32 34 PM

chunyuma commented 2 years ago

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

saramsey commented 2 years ago

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

amykglen commented 2 years ago

Should we add "slim database" to the KG2c template checklist? (maybe it's already on there, I didn't check).

yep, we have an item for slim databases already

amykglen commented 2 years ago

@amykglen, I think I figure out the problem. It seems like that the function _get_cypher_for_query_edge is deprecated now. This might be an old function that expand used to create the neo4j query. Perhaps we now have other functions somewhere in expand to process this. I'm now modifying this function to solve this error temporarily. Could you please let me know where I can find the new function to replace this function so that we can make everything consistent? Thanks!

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

chunyuma commented 2 years ago

the rest of Expand doesn't use neo4j at all anymore, so there is no current _get_cypher_for_query_edge function. what do you use neo4j for in DTD? would it maybe be possible to query Plover instead?

@amykglen, in DTD querier, it contains two modes "fast mode" and "slow mode". The "fast mode" is to query the DTD database directly while the "slow mode" is to call the DTD model and compute the drug repurposing probability on the fly. So when we use "slow mode", we need _get_cypher_for_query_edge function to query the possible subject node or the possible object node based on the query_graph. Is it possible to query Plover for this goal?

Take test_dtd_expand_2 as example:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

the "slow mode" needs to know what "n1" nodes should be paired with acetaminophen to compute their probabilities by using the model.

amykglen commented 2 years ago

so you mean you need to run the one-hop query on KG2 to get diseases connected to acetaminophen?

you can do that with Plover like so:

trapi_qg = {
        "edges": {
            "e00": {
                "subject": "n00",
                "object": "n01",
            }
        },
        "nodes": {
            "n00": {
                "ids": ["CHEMBL.COMPOUND:CHEMBL112"]
            },
            "n01": {
                "categories": ["biolink:Disease"]
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

by default it will return answers in this format (only including node/edge IDs):

{
   "edges":{
      "e00":[
         19308544,
         26624039,
         11296815,
         12484663,
         15564856,
         9568317,
         12222530,
         23814212,
         12222534,
         11395143,
         11214936,
         16932955,
          ...
      ]
   },
   "nodes":{
      "n00":[
         "CHEMBL.COMPOUND:CHEMBL112"
      ],
      "n01":[
         "MESH:D014886",
         "MONDO:0009323",
         "MONDO:0020722",
         "MONDO:0001384",
         "MONDO:0003406",
         "UMLS:C0429001",
         "MESH:D010539",
         "MONDO:0007254",
         "MESH:D020078",
         "CHEMBL.COMPOUND:CHEMBL326958",
         "UMLS:C0375314",
         "MONDO:0100053",
         "MONDO:0005812",
         "MONDO:0005010",
         "MONDO:0001246",
         "MONDO:0001046",
         "MESH:D048949",
         "MONDO:0002334",
         "MONDO:0004553",
         "MONDO:0007186",
         "MESH:D014950",
         "MONDO:0100192",
         "UMLS:C0442797",
         "UMLS:C0231225",
         "MONDO:0005101",
         "MONDO:0010667",
         "MONDO:0001156",
           ...
      ]
   }
}

but if you want more info included in the results you can add "include_metadata": True to your query graph

chunyuma commented 2 years ago

Thanks @amykglen. If I only want all nodes with categories 'biolink:Disease' or 'biolink:DiseaseOrPhenotypicFeature' or 'biolink:PhenotypicFeature', I think the Plover can also do this by modifying the trapi_qg like:

trapi_qg = {
        "nodes": {
            "n00": {
                "categories": [ 'biolink:Disease', 'biolink:DiseaseOrPhenotypicFeature, 'biolink:PhenotypicFeature']
            }
        }
    }
rtxc = RTXConfiguration()
response = requests.post(f"{rtxc.plover_url}/query", json=trapi_qg, headers={'accept': 'application/json'})

Is it right?

amykglen commented 2 years ago

yep!

amykglen commented 2 years ago

or wait, so you're trying to get all disease-like nodes in KG2? (not just connected to acetaminophen?) not sure whether that would work...

chunyuma commented 2 years ago

@amykglen, yes, I'm thinking that the DTD expand should be independent of RTX-KG2c, right? This means there are some edges generated by DTD expand based on the DTD model with probability > certain threshold which might exist in the RTX-KG2c. So back to the acetaminophen case, the DTD expand should consider all disease-like nodes in KG2 and then use DTD model to calculate the probabilities and then expand the edges, right?

amykglen commented 2 years ago

ah, ok, I didn't realize you're looking up all disease-like nodes. yeah, that won't work with Plover.

so you really only need to get the list of all disease-like node IDs once, right? (for each KG2 version.) not on every query?

could you do that during building of DTD? (and then just store the list of IDs in one of your DTD databases, or a separate database if you preferred, which could be added to the database manger)

chunyuma commented 2 years ago

Actually, not just all disease-like node IDs. The reason it is a list of all disease-like node IDs is because in this query, we try to expand n0:'acetaminophen' to n1:'disease-like' nodes:

"add_qnode(name=acetaminophen, key=n0)",
"add_qnode(categories=biolink:Disease, key=n1)",
"add_qedge(subject=n0, object=n1, key=e0)",
"expand(edge_key=e0, kp=DTD, DTD_threshold=0, DTD_slow_mode=True)",
"return(message=true, store=false)"

Perhaps in other queries, people are interested in the expand for acetaminophen to other categories via DTD expand. (Note that currently we don't check the category provided by the user via slow mode. In other words, it is allowed that people can provide any kinds of categories via DTD expand.). So actually, we need a function that can extract all nodes corresponding to the categories provided by the user. Do you think it is feasible?

I think we can pre-store the ID list corresponding to different categories. However, I'm not sure if we need to consider the hierarchical relation. For example, if the user set add_qnode(categories=biolink:ChemicalEntity, key=n1), we need to use all nodes corresponding to biolink:ChemicalEntity and also include all its children. I think this is what exapnd is currently doing in RTX-KG2, right?

amykglen commented 2 years ago

I think that's right that you would want to do hierarchical reasoning for these category ID lists. if you query the KG2c neo4j by label (e.g., match (n:biolink:ChemicalEntity) return n.id), that reasoning should be done for you (since nodes are labeled with their direct categories as well as the ancestors of those categories).

(Plover can't help here since it doesn't currently allow queries where no qnode is "pinned")

amykglen commented 2 years ago

hey @finnagin - have you updated the test triples (for the NCATS repo) for KG2.7.4 yet?

finnagin commented 2 years ago

The pull request for updating the test triples is now in the NCATSTranslator/testing repo.

finnagin commented 2 years ago

Closing as the smart api registry looks to be updated and everything else nopt marked as able to be skipped has been checked

RTXteam / RTX