RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Adjust ARAX to work with KG2.3.4 #990

Closed amykglen closed 4 years ago

amykglen commented 4 years ago

documentation about what things will need updating is at: https://github.com/RTXteam/RTX/wiki/Deployment-info#things-that-need-updating-when-rolling-out-a-new-kg2-version

rebuild/edit these and put them in /data/orangeboard/databases/KG2.3.4 on the server:

other to do's:

finally:

amykglen commented 4 years ago

@chunyuma - I'm running into errors with a couple FET tests when testing integration with KG2.3.4 in the kg2-arax-integration branch:

FAILED test_ARAX_workflows.py::test_FET_example_2 - AssertionError: assert 'ERROR' == 'OK'
FAILED test_ARAX_workflows.py::test_FET_example_4 - AssertionError: assert 'ERROR' == 'OK'

here's the error for test_FET_example_2:

  - 2020-09-15 16:38:25.932687 INFO: After Expand, Message.KnowledgeGraph has 1044 nodes and 1758 edges (FET1: 1, FET2: 5, FET3: 132, e00: 2, e01: 5, e02: 133, e03: 1485, n00: 1, n01: 1, n02: 5, n03: 126, n04: 912)
  - 2020-09-15 16:38:25.932818 INFO: Processing action 'overlay' with parameters {'action': 'fisher_exact_test', 'source_qnode_id': 'n03', 'target_qnode_id': 'n04', 'virtual_relation_label': 'FET4'}
  - 2020-09-15 16:38:25.932837 DEBUG: Applying Overlay to Message with parameters {'action': 'fisher_exact_test', 'source_qnode_id': 'n03', 'target_qnode_id': 'n04', 'virtual_relation_label': 'FET4'}
  - 2020-09-15 16:38:25.933574 INFO: Performing Fisher's Exact Test to add p-value to edge attribute of virtual edge
  - 2020-09-15 16:38:25.999913 ERROR: Traceback (most recent call last):
  File "/Users/amyglen/Projects/RTX/code/ARAX/test/../ARAXQuery/Overlay/fisher_exact_test.py", line 161, in fisher_exact_test
    nodes_info[edge.source_id]['edge_index'].append(count)
KeyError: 'UniProtKB:Q96J02'

  - 2020-09-15 16:38:25.999929 ERROR: Something went wrong with retrieving edges in message KG

and the one for test_FET_example_4:

  - 2020-09-15 16:57:23.685977 DEBUG: ARAX/KG2C was used to calculate total adjacent nodes in Fisher's Exact Test
  - 2020-09-15 17:04:38.697293 WARNING: Although ARAX/KG2 was found to have the maximum number of edges connected to both n01 and n02, ARAX/KG1 and cypher query were used to find the total number of nodes with the same type of source node with qnode id n01 as KG2 might have many duplicates
  - 2020-09-15 17:04:38.697321 DEBUG: Total 12089 nodes with node type phenotypic_feature was found in ARAX/KG1
  - 2020-09-15 17:04:38.697325 DEBUG: Computing Fisher's Exact Test P-value
  - 2020-09-15 17:04:39.489769 ERROR: Traceback (most recent call last):
  File "/Users/amyglen/Projects/RTX/code/ARAX/test/../ARAXQuery/Overlay/fisher_exact_test.py", line 774, in _calculate_FET_pvalue_parallel
    pvalue = stats.fisher_exact(contingency_table)[1]
  File "/Users/amyglen/.pyenv/versions/3.7.8/envs/arax/lib/python3.7/site-packages/scipy/stats/stats.py", line 3630, in fisher_exact
    raise ValueError("All values in `table` must be nonnegative.")
ValueError: All values in `table` must be nonnegative.

  - 2020-09-15 17:04:39.489942 ERROR: Something went wrong for target node MONDO:0000001 to calculate FET p-value
  - 2020-09-15 17:04:39.489951 ERROR: Traceback (most recent call last):
  File "/Users/amyglen/Projects/RTX/code/ARAX/test/../ARAXQuery/Overlay/fisher_exact_test.py", line 774, in _calculate_FET_pvalue_parallel
    pvalue = stats.fisher_exact(contingency_table)[1]
  File "/Users/amyglen/.pyenv/versions/3.7.8/envs/arax/lib/python3.7/site-packages/scipy/stats/stats.py", line 3630, in fisher_exact
    raise ValueError("All values in `table` must be nonnegative.")
ValueError: All values in `table` must be nonnegative.

  - 2020-09-15 17:04:39.489957 ERROR: Something went wrong for target node UMLS:C1519221 to calculate FET p-value
  - 2020-09-15 17:04:39.489962 ERROR: Traceback (most recent call last):
  File "/Users/amyglen/Projects/RTX/code/ARAX/test/../ARAXQuery/Overlay/fisher_exact_test.py", line 774, in _calculate_FET_pvalue_parallel
    pvalue = stats.fisher_exact(contingency_table)[1]
  File "/Users/amyglen/.pyenv/versions/3.7.8/envs/arax/lib/python3.7/site-packages/scipy/stats/stats.py", line 3630, in fisher_exact
    raise ValueError("All values in `table` must be nonnegative.")
ValueError: All values in `table` must be nonnegative.

  - 2020-09-15 17:04:39.489967 ERROR: Something went wrong for target node MONDO:0004992 to calculate FET p-value
  - 2020-09-15 17:04:39.489973 ERROR: Traceback (most recent call last):
  File "/Users/amyglen/Projects/RTX/code/ARAX/test/../ARAXQuery/Overlay/fisher_exact_test.py", line 774, in _calculate_FET_pvalue_parallel
    pvalue = stats.fisher_exact(contingency_table)[1]
  File "/Users/amyglen/.pyenv/versions/3.7.8/envs/arax/lib/python3.7/site-packages/scipy/stats/stats.py", line 3630, in fisher_exact
    raise ValueError("All values in `table` must be nonnegative.")
ValueError: All values in `table` must be nonnegative.

  - 2020-09-15 17:04:39.489978 ERROR: Something went wrong for target node MONDO:0005070 to calculate FET p-value

do you have any idea what these might be about? to reproduce them, you would first need to:

  1. check out the kg2-arax-integration branch
  2. replace your config file with one that points to the new KG2/KG2C: scp ubuntu@arax.rtx.ai:/data/orangeboard/databases/KG2.3.4/config.json RTX/code/
  3. replace your NodeSynonymizer with one made from KG2-3-4: scp ubuntu@arax.rtx.ai:/data/orangeboard/databases/KG2.3.4/node_synonymizer.sqlite RTX/code/ARAX/NodeSynonymizer/
chunyuma commented 4 years ago

OK, those two errors are fixed!

chunyuma commented 4 years ago

COHD database was rebuilt based on kg2.3.4 and was named as COHDdatabase_v2.0.db. I put it under /data/orangeboard/databases/KG2.3.4 on arax.rtx.ai server.

Here is a summary for the new COHD database.

The number of nodes with different types used for building the new COHD database

Preferred Type Number of Nodes Number of Nodes with OMOP ids percent (%)
chemical_substance 2198938 63215 2.87
protein 7002 940 13.42
organism_taxon 1145 650 56.77
anatomical_entity 141 119 84.4
phenotypic_feature 67343 32738 48.61
named_thing 2593 2343 90.36
disease 127271 107207 84.24
drug 110973 98867 89.09
molecular_entity 10406 8719 83.79
metabolite 7 4 57.14
biological_entity 55 6 10.91
ontology_class 33 21 63.64
genomic_entity 38346 430 1.12
individual_organism 385 360 93.51
gross_anatomical_structure 3697 3245 87.77
cellular_component 14 13 92.86
procedure 335 326 97.31
information_content_entity 1201 1011 84.18
attribute 47 19 40.43
publication 141 94 66.67
device 655 652 99.54
disease_or_phenotypic_feature 53171 7459 14.03
activity_and_behavior 902 879 97.45
occurrent 61546 61088 99.26
phenomenon 14780 12264 82.98
physiological_process 210 165 78.57
molecular_activity 2 2 100.0
gene 120 3 2.5
biological_process 32 24 75.0
pathway 2 1 50.0
abstract_entity 5 5 100.0
gene_grouping 1 1 100.0
clinical_intervention 1 1 100.0
quantity_value 3 3 100.0
material_sample 2 1 50.0
provider 1 1 100.0
gene_family 1 1 100.0
relationship_type 3 0 0.0
gene_product 1 0 0.0
cell 1 0 0.0

Note: Some nodes with either chemical_substance, phenotypic_feature, drug, disease in KG2.3.4 will map to other preferred types in KG2.3.4C.

chunyuma commented 4 years ago

Hi @amykglen, all databases were updated for KG2.3.4 and put in /data/orangeboard/databases/KG2.3.4 on the server. Please help test everything together in your dev environment. Thank you!

amykglen commented 4 years ago

awesome! yes, I'll work on testing everything together. thanks!

amykglen commented 4 years ago

alright, so I tested ARAX+KG2.3.4 in a local setup, using the KG2.3.4-specific config.json, node_synonymizer.sqlite, NGD database, COHD database, and DTD database (stored in /data/orangeboard/databases/KG2.3.4/ on the server), and all is running smoothly! the entire pytest suite passes, including slow tests. (this testing was done in the kg2-arax-integration branch of course, which contains the code changes necessitated by this new KG2 version.)

so I think we are ready to make KG2.3.4 our "production" KG2 at any point now, by:

  1. merging the kg2-arax-integration branch into master
  2. following the rollout recipe @edeutsch put together, which looks like:
    
    # Database cache area
    Outside container is /data/orangeboard/databases/KG2.3.4
    Inside container is /mnt/data/orangeboard/databases/KG2.3.4

Setup

INST=devED SRC=/mnt/data/orangeboard/databases/KG2.3.4 DST=/mnt/data/orangeboard/$INST/RTX

Roll out code

cd $DST/code git pull

Copy config.json

cd $DST/code cp -p $SRC/config.json .

Copy NodeSynonymizer

cd $DST/code/ARAX/NodeSynonymizer cp -p $SRC/node_synonymizer.sqlite .

Copy NGD

cd $DST/code/ARAX/ARAXQuery/Overlay/ngd cp -p $SRC/curie_to_pmids.sqlite .

Copy COHD

cd $DST/code/ARAX/KnowledgeSources/COHD_local/data cp -p $SRC/COHDdatabase_v*.db .

Copy Drug-Treats-Disease

cd $DST/code/ARAX/ARAXQuery/Overlay/predictor/retrain_data cp -p $SRC/GRAPH.sqlite . cp -p $SRC/LogModel.pkl .

Run tests

cd $DST/code/ARAX/test pytest -v --durations=10

Restart the service

INST=devED service RTXOpenAPI$INST restart sleep 1 tail -f /tmp/RTXOpenAPI$INST.elog