RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Local copy and implementation of COHD #875

Closed dkoslicki closed 4 years ago

dkoslicki commented 4 years ago

As we've noticed, calling out to COHD via API can be very, very slow. I reached out to Casey Ta, and he graciously sent me a dump of their entire database! It can be accessed here. He also offered a mySQL dump if that's more helpful.

In order to speed up our overlay queries, it would be great if this could be integrated into our system (similar to our fast_ngd ala #654).

A few details:

  1. COHD uses OMOP concept ID's for its identifiers
  2. This requires me to hit the endpoint xrefToOMOP which basically just uses oxo to map CURIES to OMOP concept ids. Example in our code here

So to accomplish this, we would want to spin up a database that:

Medium high-priority as this would greatly reduce the time it takes for me (and others) to come up with good example DSL for the query_graph_interpreter_templates.yaml.

Looking for volunteers to take this task on, so I will assign many, and de-assign yourself if you don't have the bandwidth to do this.

dkoslicki commented 4 years ago

Note, there's also an RDF triple store dump, so there's an off chance that we could get OMOP concepts into KG2 directly and be able to skip the OXO mapping part, but I would need @saramsey to comment about if this is feasible (since we would need to connect the diseases, drugs, etc. to the OMOP nodes).

From Casey

Vincent has also done a bunch of work converting the COHD data into RDF modeled in Biolink. If you're interested, you can query that at https://trek.semanticscience.org/ or hit the download link to get the RDF dumps.

chunyuma commented 4 years ago

@dkoslicki, I think I have time to do it. I have already gone through the scripts under Overlay folder so I know how the overlay_clinical_info.py works. One question for this is where we should store this database.

chunyuma commented 4 years ago

"script-ify" everything so that when KG2 and/or COHD is updated, the backend database can be updated as well (and so devs (like me) can have local copies of the database as needed as well).

I'm a little bit confused about this. Do we want to create a script to automatically pull down the data from KG2 and/or COHD when they are updated and then integrated them into the local database? When COHD is updated, will we have the updated dump file of the entire database?

dkoslicki commented 4 years ago

@chunyuma Excellent! Thanks for volunteering!!

Re: “script-ify” I basically mean: create scripts that will automate every step of the process. That way, it can be deployed anywhere as needed. Similar to how python3 KGNodeIndex.py -b auto-creates the KGNode index, and can be run on local dev machines, or prod servers as needed.

As for where the database is to be stored, I would say: get the scripts/classes/etc. up and running on your own instance, and then after everything is checked that it works, we can work with Eric to get it created on the production server as well. I’d suggest doing this in a different branch so we can seamlessly integrate into master after you’ve got everything put together and confirmed it works.

Minor details:

  1. You may want to make it such that the code auto-downloads the data dump or has a CLI option to point to a different dump (as this won’t be stored in github) that way we can “future-proof” for new releases of COHD.
  2. The OXO mapping of disease, drug, etc. nodes to OMOP concept ids will probably be non-trivial and take quite a while, so making this a script as well makes sense
  3. As for where all these scripts will live, a natural place seems to be code/ARAX/KnowledgeSources under a folder you create with a name something like “COHD_local”
  4. Your scripts may/will download data to that “COHD_local” folder, but they shouldn’t be checked into github (they will be too large), but the drop in replacement (eg. COHD_local.get_paired_concept_freq(omop1, omop2, 3) will know to look there for the data) similar to how KGNodeIndex creates a MySQL database that exists, but isn’t checked into github.
chunyuma commented 4 years ago

@dkoslicki, it seems like the data dump can't be downloaded by the code. I suspect this might be due to the encrypted characteristics of pCloud which only allows the users to download the data from the browser even though it can share the data with different people. I tried curl, wget and requests module in python. None of them can work for downloading the data. We can write a script to auto-download the data but it must be allowed to be downloaded by the code.

dkoslicki commented 4 years ago

@chunyuma Given the data can't be downloaded with wget or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly): "downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",

Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.

saramsey commented 4 years ago

@chunyuma Given the data can't be downloaded with wget or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly): "downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",

Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.

This link no longer seems to work:

wget https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
--2020-07-08 16:06:53--  https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
Resolving p-def7.pcloud.com (p-def7.pcloud.com)... 74.120.9.15
Connecting to p-def7.pcloud.com (p-def7.pcloud.com)|74.120.9.15|:443... connected.
HTTP request sent, awaiting response... 410 Gone
2020-07-08 16:06:53 ERROR 410: Gone.
dkoslicki commented 4 years ago

Odd. I guess after we get a local system up and running, we work with Casey to make the data dump persistent & accessible

chunyuma commented 4 years ago

@saramsey, it works with this link

https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZldlF37Z2ZZF00ZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZHDNaREulz4yiRznCQ1HJb05lPECy/cohd-v2.tar.gz

But it's only allowable to download via browser but not wget

chunyuma commented 4 years ago

Update: The local COHD database was established and stored in arax.rtx.ai and the new script COHDIndex.py can replace all functions of the original script QueryCOHD.py used in overlay_clinical_info.py. I carefully compared the results from both COHDIndex and QueryCOHD. They return the same results and also applying COHDIndex is significantly faster than QueryCOHD in overlay(action=overlay_clinical_info,...).

I also modified overlay_clinical_info.py in issue875 branch to make it compatible with COHDIndex.py. The COHDIndex passed all tests in test_ARAX_overlay.py. Here is the result:

(RTX_env) ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py
===================================================================== test session starts ======================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 18 items

test_ARAX_overlay.py::test_jaccard PASSED                                                                                                                [  5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED                                                                                                         [ 11%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED                                                                                                    [ 16%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED                                                                                                  [ 22%]
test_ARAX_overlay.py::test_FET_ex1 PASSED                                                                                                                [ 27%]
test_ARAX_overlay.py::test_FET_ex2 PASSED                                                                                                                [ 33%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED                                                                                       [ 38%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED                                                                                     [ 44%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED                                                                                        [ 50%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED                                                                                      [ 55%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED                                                                                                     [ 61%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED                                                                                                   [ 66%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED                                                                                    [ 72%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED                                                                                  [ 77%]
test_ARAX_overlay.py::test_issue_832 PASSED                                                                                                              [ 83%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED                                                                                                     [ 88%]
test_ARAX_overlay.py::test_issue_840 PASSED                                                                                                                   [ 94%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED                                                                                                          [100%]

Currently, the COHDIndex only works for KG1.

The next steps needed to be done:

Once these steps are done, it can be merged into master branch

saramsey commented 4 years ago

This link goes to the download page at the Athena portal for downloading mappings between OMOP and other biomedical vocabularies like ICD10, MeSH, SNOMED, LOINC, RXNORM, etc.

https://athena.ohdsi.org/vocabulary/list

dkoslicki commented 4 years ago

So @chunyuma plan is: use node_synonymizer.py -l <curie> -k KG2 to lookup all synonyms that map to the biomedical vocabularies that map to OMOP concept ID's in the link Steve gave above. Use this to populate the CURIE to OMOP mappings in your sqlite database

chunyuma commented 4 years ago

Do we need CDT (Current Dental Terminology (ADA)) data and MedDRA (Medical Dictionary for Regulatory Activities (MSSO)) data? From the NodeSynonymizer results, we have 156982 curies which have MedDRA synonym and 6 curies which have CDT. However, these data from Athena portal need license to download.

dkoslicki commented 4 years ago

@chunyuma do these CDT and MedDRA curies correspond to drugs, chemical_substances, and/or phenotypic_features? If so, then the 156K MedDRA seem the most important to get (missing 5 curies from CDT doesn't seem like too big of a deal). Perhaps @saramsey already has a license for CDT since, if it's in the NodeSynonymizer, it must be in KG2 somewhere...

chunyuma commented 4 years ago

In these 156K MedDRA, 2273 are drug, 1395 are chemical_substance, 119382 are disease and 33932 are phenotypic_feature.

In those 6 CDT, drug has 2, chemical_substance has 3 and phenotypic_feature has 1.

I actually sent an email to ohdsi (Observational Health Data Sciences and Informatics) and they told me that for MedDRA, we as a academic institution might get a free license from the MSSO.

dkoslicki commented 4 years ago

@chunyuma great that we might get a free license! Check with @saramsey if we already do or do not have a license, and if not, let us (me and/or Steve) know how we can help with getting the license for MedDRA

chunyuma commented 4 years ago

@dkoslicki, it seems like we should have the license for MedDRA already because I saw @saramsey has ever posted an associated issue #891 about MedDRA. We can check it with @saramsey in AHM meeting on Wednesday. If we don't have, perhaps I might need the help from you or Steve to get this license.

I'm actually running the script to map the curies to OMOP ids based on the data I can get from Athena portal. So this might need to take a few days. Once we have the MedDRA data, we can just simply add it to the existing data.

chunyuma commented 4 years ago

Update: Since Steve told me that we don't have a license to download the MedDRA data from Athena portal, I'm trying to contact MedDRA MSSO Help Desk to see if we can get a free, special license.

Based on other data that I can download from Athena portal, I mapped all KG1 and KG2 disease, phenotype, chemical_substance, and drug to (possibly a list) of OMOP concept ids. Here is the summary:

KG1: total 33,888 disease, phenotype, chemical_substance nodes of which 18,981 nodes were found to have at least one OMOP ids via COHD API while 15,253 nodes were found via the data downloaded from Athena portal (excluded MedDRA data).

KG2: total 3,229,158 disease, phenotype, chemical_substance and drug nodes of which 443,933 nodes were found to have at least one OMOP ids via the data download from Athena portal (excluded MedDRA data).

dkoslicki commented 4 years ago

Thanks for the update @chunyuma ! Let us know if you run into any problems with the MedDRA license, since now that you have mapped many of the nodes to OMOP ID's, if they don't want to give you a license, at least we could hit COHD for the remaining unmapped curies (which is now much less due to your work) 156K calls is definitely better than millions.

chunyuma commented 4 years ago

@dkoslicki, for KG2, only 443,933 nodes were mapped to at least one OMOP ID's. This means that the remaining unmapped curies still have around 2.7 millions (we have total 3,229,158 disease, phenotype, chemical_substance and drug nodes in KG2). So I don't think it it practical to call COHD API for these remaining 2.7 millions. Actually, I guess it is possible that there are many of these remaining unmapped curies which indeed don't have mapped OMOP ids even though we call COHD API.

Also, if we consider the synonyms of these remaining curies when we call COHD API. It might be much larger than 2.7 millions.

chunyuma commented 4 years ago

@dkoslicki, actually not all those 156K MedDRA-associated curies have no OMOP ids.

Let's say, there is a curie which has multiple synonyms from NodeSynonymizer. They includes Mesh:xxxx, ATC:xxx, ICD10CM:xxx and MedDRA:xxx and etc. And Mesh:xxxx, ATC:xxx, ICD10CM:xxx have their corresponding OMOP ids. Then this curie would already have three OMOP ids. Even though we can download MedDRA data, we might just add more OMOP ids to the existing OMOP list of the curies.

chunyuma commented 4 years ago

Update: After including MedDRA data, the number of nodes with at least one OMOP ids in KG1 is 15,842 while the number of nodes with at least one OMOP ids in KG2 is 480,123.

@dkoslicki, if we don't need to hit COHD API for the remaining unmapped curies, then I can start to build the database.

chunyuma commented 4 years ago

Update: After hitting OxO API, the number of nodes with at least one OMOP ids in KG1 increases to 21,148 while the number of nodes with at least one OMOP ids in KG2 increases to 502,605.

dkoslicki commented 4 years ago

Awesome, thanks @chunyuma! For each of the relevant node types (eg. chemical_substance, disease, etc.), what is that in terms of percent? I.e. disease: X% of nodes with at least one OMOP id

chunyuma commented 4 years ago

@dkoslicki, here is some statistics for KG1 and KG2.

KG1

type number percent
chemical_substance 2,226 6.57%
disease 19573 57.76%
phenotypic_feature 12089 35.67%
total 33,888 100%

Within chemical_substance

prefix has OMOP ids percent total
CHEMBL.COMPOUND 2,104 94.52% 2,226
total 2,104 94.52% 2,226

Within disease

prefix has OMOP ids percent total
DOID 8,132 73.31% 11,092
MONDO 1 100% 1
OMIM 6,597 77.79% 8,480
total 14,730 75.26% 19,573

Within phenotypic_feature

prefix has OMOP ids percent total
AQTLTrait 0 0% 1
HP 4,314 35.69% 12,088
total 4,314 35.69% 12,089

KG2

type number percent
chemical_substance 2,164,168 67.02%
disease 252,301 7.81%
drug 673,599 20.86%
phenotypic_feature 139,090 4.31%
total 3,229,158 100%

Within chemical_substance

prefix has OMOP ids percent total
CHEBI 8,575 7.61% 112,632
GS 7 1.31% 536
FOODON 1 7.69% 13
CHV 708 38.33% 1,847
NCI_DICOM 0 0.0% 1
NCI_CDISC-GLOSS 0 0.0% 4
HCDT 0 0.0% 1
HL7 1 1.96% 51
MESH 1,680 1.72% 97,674
CHEMBL.COMPOUND 12,835 0.71% 1,795,373
MTHSPL 2,081 66.51% 3,129
NCI_DCP 11 27.5% 40
USP 118 89.39% 132
TTD 106 21.33% 497
SNOMEDCT_VET 0 0.0% 5
NCI_FDA 533 49.03% 1,087
NANDA-I 0 0.0% 3
NCI_DTP 11 39.29% 28
BTO 0 0.0% 1
NCI_NICHD 0 0.0% 2
NCI_CRCH 49 24.75% 198
EFO 26 34.67% 75
NCI_NCI-GLOSS 56 34.36% 163
LNC 988 44.71% 2,210
GTPI 58 2.4% 2,412
VANDF 213 33.86% 629
DRUGBANK 147 39.95% 368
NCI_CDISC 60 31.91% 188
NCI_NCPDP 168 46.93% 358
CHEMBL.TARGET 20 80.0% 25
CUI 10,359 7.52% 137,764
NDDF 492 20.26% 2,429
HCPCS 162 100.0% 162
NCIT 0 0.0% 3
RXNORM 3,454 99.97% 3,455
OBO 0 0.0% 618
PDQ 10 18.18% 55
total 42,929 1.98% 2,164,168

Within disease

prefix has OMOP ids percent total
OMIM 0 0.0% 3
DSM5 156 21.67% 720
CHV 6,810 73.13% 9,312
NCI_CDISC-GLOSS 1 100.0% 1
HL7 1 100.0% 1
MESH 4,163 69.15% 6,020
MEDDRA 22,274 100.0% 22,274
DOID 8,656 73.21% 11,823
MTHICPC2EAE 7 53.85% 13
SNOMEDCT_VET 710 32.98% 2,153
NCI_CTCAE_3 1 100.0% 1
NCI_FDA 151 73.3% 206
NANDA-I 193 73.11% 264
NCI_NICHD 2,002 71.81% 2,788
SO 61 2.94% 2,073
MONDO 16,415 75.96% 21,609
EFO 2,905 84.47% 3,439
NCI_NCI-GLOSS 729 57.45% 1,269
LNC 755 48.06% 1,571
NCI_CDISC 85 20.94% 406
NCI_CTEP-SDC 97 39.92% 243
CCS 48 28.24% 170
NCI_RENI 0 0.0% 164
HCPT 0 0.0% 1
NCI_CTCAE 378 99.74% 379
CUI 81,703 60.5% 135,050
ICD10AE 338 55.87% 605
PDQ 16 23.88% 67
Orphanet 7,712 80.94% 9,528
NCIT 7,887 45.96% 17,161
NCI_GAIA 4 100.0% 4
NCI_CTRP 504 17.6% 2,863
NCI_KEGG 1 100.0% 1
MTHICPC2ICD10AE 1 0.84% 119
total 164,764 65.3% 252,301

Within drug

prefix has OMOP ids percent total
CHEBI 151 53.93% 280
GS 196 0.5% 38,843
CHV 4,027 49.44% 8,146
NCI_CDISC-GLOSS 1 16.67% 6
NCI_DICOM 0 0.0% 1
HL7 0 0.0% 2
MESH 2,949 5.51% 53,493
MTHSPL 19,727 16.58% 118,987
NCI_DCP 379 54.3% 698
USP 1,824 57.94% 3,148
SNOMEDCT_VET 35 48.61% 72
NCI_BRIDG 0 0.0% 1
NCI_FDA 4,863 47.2% 10,302
NANDA-I 0 0.0% 1
NCI_DTP 176 32.18% 547
NCI_NICHD 7 43.75% 16
NCI_CRCH 70 70.0% 100
EFO 1 50.0% 2
NCI_NCI-GLOSS 806 50.63% 1,592
LNC 1,651 59.15% 2,791
VANDF 3,186 19.89% 16,022
DRUGBANK 2,646 57.42% 4,608
NCI_CDISC 4 57.14% 7
HCPT 0 0.0% 5
CUI 118,081 41.66% 283,449
NDDF 4,936 23.16% 21,314
HCPCS 873 100.0% 873
NCIT 1 14.29% 7
RXNORM 102,965 96.27% 106,949
NCI_CTRP 7 58.33% 12
PDQ 560 42.26% 1,325
total 270,122 40.1% 673,599

Within phenotypic_feature

prefix has OMOP ids percent total
OMIM 11,261 10.79% 104,358
DSM5 2 100.0% 2
CHV 940 67.97% 1,383
AQTLTrait 0 0.0% 1
NCI_CDISC-GLOSS 1 50.0% 2
HL7 0 0.0% 1
HP 4,699 32.78% 14,336
MESH 1 100.0% 1
MEDDRA 2,435 100.0% 2,435
MTHICPC2EAE 2 28.57% 7
SNOMEDCT_VET 83 61.94% 134
NCI_FDA 49 66.22% 74
NANDA-I 66 51.56% 128
NCI_NICHD 134 89.33% 150
UPHENO 1 100.0% 1
MONDO 3 100.0% 3
MP 2 100.0% 2
SYMP 568 60.17% 944
EFO 104 69.33% 150
NCI_NCI-GLOSS 42 89.36% 47
LNC 135 53.36% 253
NBO 124 41.89% 296
NCI_CDISC 7 46.67% 15
CCS 6 100.0% 6
CHEMBL.TARGET 2 100.0% 2
NCI_CTCAE 107 100.0% 107
CUI 3,933 31.77% 12,379
ICD10AE 11 84.62% 13
NCIT 70 3.77% 1,858
NCI_GAIA 1 100.0% 1
NCI_CTRP 1 100.0% 1
total 24,790 17.82% 139,090
chunyuma commented 4 years ago

@dkoslicki, the local COHD database was updated to include the mapping of all KG1 and KG2 disease, phenotype, chemical_substance, and drug curies. The script COHDIndex.py and overlay_clinical_info.py were updated as well.

The updated scripts passed all non-skipped (including the ones marked with slow) tests in test_ARAX_overlay.py and test_ARAX_workflows.py.

Here are some results for testing the COHDIndex in test_ARAX_overlay.

ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py --runslow
============================================================================ test session starts ============================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 19 items                                                                                                                                                          

test_ARAX_overlay.py::test_jaccard PASSED                                                                                                                             [  5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED                                                                                                                      [ 10%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED                                                                                                                 [ 15%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED                                                                                                               [ 21%]
test_ARAX_overlay.py::test_FET_ex1 PASSED                                                                                                                             [ 26%]
test_ARAX_overlay.py::test_FET_ex2 PASSED                                                                                                                             [ 31%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED                                                                                                    [ 36%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED                                                                                                  [ 42%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED                                                                                                     [ 47%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED                                                                                                   [ 52%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED                                                                                                                  [ 57%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED                                                                                                                [ 63%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED                                                                                                 [ 68%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED                                                                                               [ 73%]
test_ARAX_overlay.py::test_issue_832 PASSED                                                                                                                           [ 78%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED                                                                                                                  [ 84%]
test_ARAX_overlay.py::test_issue_840 PASSED                                                                                                                           [ 89%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED                                                                                                                  [ 94%]
test_ARAX_overlay.py::test_issue_892 PASSED                                                                                                                           [100%]

============================================================================= warnings summary ==============================================================================
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
    warnings.warn(msg, category=FutureWarning)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1300, "Invalid utf8mb4 character string: '80037D'")
    result = self._query(query)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=============================================================== 19 passed, 13 warnings in 1539.35s (0:25:39) ================================================================

Please check it in branch issue875. If there is no other problems, I think it's ready to be merged into master and this issue can be closed.

dkoslicki commented 4 years ago

That’s awesome @chunyuma! Do you mind doing just a bit of spot checking as well:

After that, a merge to master is fine with me!

chunyuma commented 4 years ago

@dkoslicki, here are some results of investigation.

As I said in the slack, actually it is hard to accurately compare the local version and the API version due to two reasons:

  1. The local version has more mappings than the API version because we also considered the synonyms. This might cause the local version to call get_paired_concept_freq, chi_square, observed_expected_ratio, etc more times.
  2. The time of hitting API is unstable probably due to the cache issue (the cache remembers the result within certain time interval when I called the API for the same curie multiple times)

So here is the method that I used to do the comparison. I used six sets of disease-drug pairs and then compared the average running time of each disease-drug pair within each set for each of get_paired_concept_freq, chi_square and observed_expected_ratio and the number of non-default results between the local version and the API version.

Test for get_paired_concept_freq

MONDO:0002049 - chemical_substance (Total number of pairs: 165)

local version API version
Average Time 0.051s 0.95s
#non-default results 104 21

MONDO:0019783 - chemical_substance (Total number of pairs: 69)

local version API version
Average Time 0.076s 0.4s
#non-default results 35 0

Test for get_obs_exp_ratio

MONDO:0000190 - chemical_substance (Total number of pairs: 17)

local version API version
Average Time 0.215s 1.032s
#non-default results 17 1

MONDO:0005324 - chemical_substance (Total number of pairs: 27)

local version API version
Average Time 0.1s 0.2s
#non-default results 16 2

Test for get_chi_square

MONDO:0005139 - chemical_substance (Total number of pairs: 110)

local version API version
Average Time 0.1715s 0.39s
#non-default results 72 11

MONDO:0006652 - chemical_substance (Total number of pairs: 190)

local version API version
Average Time 0.11s 0.3s
#non-default results 119 24
chunyuma commented 4 years ago

This issue has already been merged into master branch.