Closed dkoslicki closed 4 years ago
Note, there's also an RDF triple store dump, so there's an off chance that we could get OMOP concepts into KG2 directly and be able to skip the OXO mapping part, but I would need @saramsey to comment about if this is feasible (since we would need to connect the diseases, drugs, etc. to the OMOP nodes).
From Casey
Vincent has also done a bunch of work converting the COHD data into RDF modeled in Biolink. If you're interested, you can query that at https://trek.semanticscience.org/ or hit the download link to get the RDF dumps.
@dkoslicki, I think I have time to do it. I have already gone through the scripts under Overlay
folder so I know how the overlay_clinical_info.py
works. One question for this is where we should store this database.
"script-ify" everything so that when KG2 and/or COHD is updated, the backend database can be updated as well (and so devs (like me) can have local copies of the database as needed as well).
I'm a little bit confused about this. Do we want to create a script to automatically pull down the data from KG2 and/or COHD when they are updated and then integrated them into the local database? When COHD is updated, will we have the updated dump file of the entire database?
@chunyuma Excellent! Thanks for volunteering!!
Re: “script-ify” I basically mean: create scripts that will automate every step of the process. That way, it can be deployed anywhere as needed. Similar to how python3 KGNodeIndex.py -b
auto-creates the KGNode index, and can be run on local dev machines, or prod servers as needed.
As for where the database is to be stored, I would say: get the scripts/classes/etc. up and running on your own instance, and then after everything is checked that it works, we can work with Eric to get it created on the production server as well. I’d suggest doing this in a different branch so we can seamlessly integrate into master after you’ve got everything put together and confirmed it works.
Minor details:
COHD_local.get_paired_concept_freq(omop1, omop2, 3)
will know to look there for the data) similar to how KGNodeIndex creates a MySQL database that exists, but isn’t checked into github.@dkoslicki, it seems like the data dump can't be downloaded by the code. I suspect this might be due to the encrypted characteristics of pCloud which only allows the users to download the data from the browser even though it can share the data with different people. I tried curl
, wget
and requests
module in python. None of them can work for downloading the data. We can write a script to auto-download the data but it must be allowed to be downloaded by the code.
@chunyuma
Given the data can't be downloaded with wget
or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly): "downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",
Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.
@chunyuma Given the data can't be downloaded with
wget
or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly):"downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",
Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.
This link no longer seems to work:
wget https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
--2020-07-08 16:06:53-- https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
Resolving p-def7.pcloud.com (p-def7.pcloud.com)... 74.120.9.15
Connecting to p-def7.pcloud.com (p-def7.pcloud.com)|74.120.9.15|:443... connected.
HTTP request sent, awaiting response... 410 Gone
2020-07-08 16:06:53 ERROR 410: Gone.
Odd. I guess after we get a local system up and running, we work with Casey to make the data dump persistent & accessible
@saramsey, it works with this link
https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZldlF37Z2ZZF00ZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZHDNaREulz4yiRznCQ1HJb05lPECy/cohd-v2.tar.gz
But it's only allowable to download via browser but not wget
Update: The local COHD database was established and stored in arax.rtx.ai
and the new script COHDIndex.py can replace all functions of the original script QueryCOHD.py
used in overlay_clinical_info.py
. I carefully compared the results from both COHDIndex
and QueryCOHD
. They return the same results and also applying COHDIndex
is significantly faster than QueryCOHD
in overlay(action=overlay_clinical_info,...)
.
I also modified overlay_clinical_info.py
in issue875
branch to make it compatible with COHDIndex.py
. The COHDIndex
passed all tests in test_ARAX_overlay.py
. Here is the result:
(RTX_env) ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py
===================================================================== test session starts ======================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 18 items
test_ARAX_overlay.py::test_jaccard PASSED [ 5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED [ 11%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED [ 16%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED [ 22%]
test_ARAX_overlay.py::test_FET_ex1 PASSED [ 27%]
test_ARAX_overlay.py::test_FET_ex2 PASSED [ 33%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED [ 38%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED [ 44%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED [ 50%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED [ 55%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED [ 61%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED [ 66%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED [ 72%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED [ 77%]
test_ARAX_overlay.py::test_issue_832 PASSED [ 83%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED [ 88%]
test_ARAX_overlay.py::test_issue_840 PASSED [ 94%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED [100%]
Currently, the COHDIndex
only works for KG1.
The next steps needed to be done:
[x] make the implementation of all functions of QueryCOHD.py
in COHDIndex
[x] finish the mapping of KG2 disease
, phenotype
, chemical_substance
, and drug
to (possibly a list) of OMOP concept ids and store this in the database
Once these steps are done, it can be merged into master
branch
This link goes to the download page at the Athena portal for downloading mappings between OMOP and other biomedical vocabularies like ICD10, MeSH, SNOMED, LOINC, RXNORM, etc.
So @chunyuma plan is: use node_synonymizer.py -l <curie> -k KG2
to lookup all synonyms that map to the biomedical vocabularies that map to OMOP concept ID's in the link Steve gave above. Use this to populate the CURIE to OMOP mappings in your sqlite database
Do we need CDT
(Current Dental Terminology (ADA)) data and MedDRA
(Medical Dictionary for Regulatory Activities (MSSO)) data? From the NodeSynonymizer
results, we have 156982 curies which have MedDRA
synonym and 6 curies which have CDT
. However, these data from Athena portal need license to download.
@chunyuma do these CDT
and MedDRA
curies correspond to drugs, chemical_substances, and/or phenotypic_features? If so, then the 156K MedDRA
seem the most important to get (missing 5 curies from CDT
doesn't seem like too big of a deal). Perhaps @saramsey already has a license for CDT since, if it's in the NodeSynonymizer
, it must be in KG2 somewhere...
In these 156K MedDRA
, 2273 are drug
, 1395 are chemical_substance
, 119382 are disease
and 33932 are phenotypic_feature
.
In those 6 CDT
, drug
has 2, chemical_substance
has 3 and phenotypic_feature
has 1.
I actually sent an email to ohdsi (Observational Health Data Sciences and Informatics) and they told me that for MedDRA
, we as a academic institution might get a free license from the MSSO.
@chunyuma great that we might get a free license! Check with @saramsey if we already do or do not have a license, and if not, let us (me and/or Steve) know how we can help with getting the license for MedDRA
@dkoslicki, it seems like we should have the license for MedDRA
already because I saw @saramsey has ever posted an associated issue #891 about MedDRA
. We can check it with @saramsey in AHM meeting on Wednesday. If we don't have, perhaps I might need the help from you or Steve to get this license.
I'm actually running the script to map the curies to OMOP ids based on the data I can get from Athena portal. So this might need to take a few days. Once we have the MedDRA
data, we can just simply add it to the existing data.
Update: Since Steve told me that we don't have a license to download the MedDRA
data from Athena portal, I'm trying to contact MedDRA
MSSO Help Desk to see if we can get a free, special license.
Based on other data that I can download from Athena portal, I mapped all KG1 and KG2 disease
, phenotype
, chemical_substance
, and drug
to (possibly a list) of OMOP concept ids. Here is the summary:
KG1: total 33,888 disease
, phenotype
, chemical_substance
nodes of which 18,981 nodes were found to have at least one OMOP ids via COHD API while 15,253 nodes were found via the data downloaded from Athena portal (excluded MedDRA
data).
KG2: total 3,229,158 disease
, phenotype
, chemical_substance
and drug
nodes of which 443,933 nodes were found to have at least one OMOP ids via the data download from Athena portal (excluded MedDRA
data).
Thanks for the update @chunyuma ! Let us know if you run into any problems with the MedDRA license, since now that you have mapped many of the nodes to OMOP ID's, if they don't want to give you a license, at least we could hit COHD for the remaining unmapped curies (which is now much less due to your work) 156K calls is definitely better than millions.
@dkoslicki, for KG2, only 443,933 nodes were mapped to at least one OMOP ID's. This means that the remaining unmapped curies still have around 2.7 millions (we have total 3,229,158 disease
, phenotype
, chemical_substance
and drug
nodes in KG2). So I don't think it it practical to call COHD API for these remaining 2.7 millions. Actually, I guess it is possible that there are many of these remaining unmapped curies which indeed don't have mapped OMOP ids even though we call COHD API.
Also, if we consider the synonyms of these remaining curies when we call COHD API. It might be much larger than 2.7 millions.
@dkoslicki, actually not all those 156K MedDRA
-associated curies have no OMOP ids.
Let's say, there is a curie which has multiple synonyms from NodeSynonymizer
. They includes Mesh:xxxx
, ATC:xxx
, ICD10CM:xxx
and MedDRA:xxx
and etc. And Mesh:xxxx
, ATC:xxx
, ICD10CM:xxx
have their corresponding OMOP ids. Then this curie would already have three OMOP ids. Even though we can download MedDRA
data, we might just add more OMOP ids to the existing OMOP list of the curies.
Update: After including MedDRA
data, the number of nodes with at least one OMOP ids in KG1 is 15,842 while the number of nodes with at least one OMOP ids in KG2 is 480,123.
@dkoslicki, if we don't need to hit COHD API for the remaining unmapped curies, then I can start to build the database.
Update: After hitting OxO API, the number of nodes with at least one OMOP ids in KG1 increases to 21,148 while the number of nodes with at least one OMOP ids in KG2 increases to 502,605.
Awesome, thanks @chunyuma! For each of the relevant node types (eg. chemical_substance
, disease
, etc.), what is that in terms of percent? I.e.
disease:
X% of nodes with at least one OMOP id
@dkoslicki, here is some statistics for KG1 and KG2.
type | number | percent |
chemical_substance | 2,226 | 6.57% |
disease | 19573 | 57.76% |
phenotypic_feature | 12089 | 35.67% |
total | 33,888 | 100% |
prefix | has OMOP ids | percent | total |
CHEMBL.COMPOUND | 2,104 | 94.52% | 2,226 |
total | 2,104 | 94.52% | 2,226 |
prefix | has OMOP ids | percent | total |
DOID | 8,132 | 73.31% | 11,092 |
MONDO | 1 | 100% | 1 |
OMIM | 6,597 | 77.79% | 8,480 |
total | 14,730 | 75.26% | 19,573 |
prefix | has OMOP ids | percent | total |
AQTLTrait | 0 | 0% | 1 |
HP | 4,314 | 35.69% | 12,088 |
total | 4,314 | 35.69% | 12,089 |
type | number | percent |
chemical_substance | 2,164,168 | 67.02% |
disease | 252,301 | 7.81% |
drug | 673,599 | 20.86% |
phenotypic_feature | 139,090 | 4.31% |
total | 3,229,158 | 100% |
prefix | has OMOP ids | percent | total |
CHEBI | 8,575 | 7.61% | 112,632 |
GS | 7 | 1.31% | 536 |
FOODON | 1 | 7.69% | 13 |
CHV | 708 | 38.33% | 1,847 |
NCI_DICOM | 0 | 0.0% | 1 |
NCI_CDISC-GLOSS | 0 | 0.0% | 4 |
HCDT | 0 | 0.0% | 1 |
HL7 | 1 | 1.96% | 51 |
MESH | 1,680 | 1.72% | 97,674 |
CHEMBL.COMPOUND | 12,835 | 0.71% | 1,795,373 |
MTHSPL | 2,081 | 66.51% | 3,129 |
NCI_DCP | 11 | 27.5% | 40 |
USP | 118 | 89.39% | 132 |
TTD | 106 | 21.33% | 497 |
SNOMEDCT_VET | 0 | 0.0% | 5 |
NCI_FDA | 533 | 49.03% | 1,087 |
NANDA-I | 0 | 0.0% | 3 |
NCI_DTP | 11 | 39.29% | 28 |
BTO | 0 | 0.0% | 1 |
NCI_NICHD | 0 | 0.0% | 2 |
NCI_CRCH | 49 | 24.75% | 198 |
EFO | 26 | 34.67% | 75 |
NCI_NCI-GLOSS | 56 | 34.36% | 163 |
LNC | 988 | 44.71% | 2,210 |
GTPI | 58 | 2.4% | 2,412 |
VANDF | 213 | 33.86% | 629 |
DRUGBANK | 147 | 39.95% | 368 |
NCI_CDISC | 60 | 31.91% | 188 |
NCI_NCPDP | 168 | 46.93% | 358 |
CHEMBL.TARGET | 20 | 80.0% | 25 |
CUI | 10,359 | 7.52% | 137,764 |
NDDF | 492 | 20.26% | 2,429 |
HCPCS | 162 | 100.0% | 162 |
NCIT | 0 | 0.0% | 3 |
RXNORM | 3,454 | 99.97% | 3,455 |
OBO | 0 | 0.0% | 618 |
PDQ | 10 | 18.18% | 55 |
total | 42,929 | 1.98% | 2,164,168 |
prefix | has OMOP ids | percent | total |
OMIM | 0 | 0.0% | 3 |
DSM5 | 156 | 21.67% | 720 |
CHV | 6,810 | 73.13% | 9,312 |
NCI_CDISC-GLOSS | 1 | 100.0% | 1 |
HL7 | 1 | 100.0% | 1 |
MESH | 4,163 | 69.15% | 6,020 |
MEDDRA | 22,274 | 100.0% | 22,274 |
DOID | 8,656 | 73.21% | 11,823 |
MTHICPC2EAE | 7 | 53.85% | 13 |
SNOMEDCT_VET | 710 | 32.98% | 2,153 |
NCI_CTCAE_3 | 1 | 100.0% | 1 |
NCI_FDA | 151 | 73.3% | 206 |
NANDA-I | 193 | 73.11% | 264 |
NCI_NICHD | 2,002 | 71.81% | 2,788 |
SO | 61 | 2.94% | 2,073 |
MONDO | 16,415 | 75.96% | 21,609 |
EFO | 2,905 | 84.47% | 3,439 |
NCI_NCI-GLOSS | 729 | 57.45% | 1,269 |
LNC | 755 | 48.06% | 1,571 |
NCI_CDISC | 85 | 20.94% | 406 |
NCI_CTEP-SDC | 97 | 39.92% | 243 |
CCS | 48 | 28.24% | 170 |
NCI_RENI | 0 | 0.0% | 164 |
HCPT | 0 | 0.0% | 1 |
NCI_CTCAE | 378 | 99.74% | 379 |
CUI | 81,703 | 60.5% | 135,050 |
ICD10AE | 338 | 55.87% | 605 |
PDQ | 16 | 23.88% | 67 |
Orphanet | 7,712 | 80.94% | 9,528 |
NCIT | 7,887 | 45.96% | 17,161 |
NCI_GAIA | 4 | 100.0% | 4 |
NCI_CTRP | 504 | 17.6% | 2,863 |
NCI_KEGG | 1 | 100.0% | 1 |
MTHICPC2ICD10AE | 1 | 0.84% | 119 |
total | 164,764 | 65.3% | 252,301 |
prefix | has OMOP ids | percent | total |
CHEBI | 151 | 53.93% | 280 |
GS | 196 | 0.5% | 38,843 |
CHV | 4,027 | 49.44% | 8,146 |
NCI_CDISC-GLOSS | 1 | 16.67% | 6 |
NCI_DICOM | 0 | 0.0% | 1 |
HL7 | 0 | 0.0% | 2 |
MESH | 2,949 | 5.51% | 53,493 |
MTHSPL | 19,727 | 16.58% | 118,987 |
NCI_DCP | 379 | 54.3% | 698 |
USP | 1,824 | 57.94% | 3,148 |
SNOMEDCT_VET | 35 | 48.61% | 72 |
NCI_BRIDG | 0 | 0.0% | 1 |
NCI_FDA | 4,863 | 47.2% | 10,302 |
NANDA-I | 0 | 0.0% | 1 |
NCI_DTP | 176 | 32.18% | 547 |
NCI_NICHD | 7 | 43.75% | 16 |
NCI_CRCH | 70 | 70.0% | 100 |
EFO | 1 | 50.0% | 2 |
NCI_NCI-GLOSS | 806 | 50.63% | 1,592 |
LNC | 1,651 | 59.15% | 2,791 |
VANDF | 3,186 | 19.89% | 16,022 |
DRUGBANK | 2,646 | 57.42% | 4,608 |
NCI_CDISC | 4 | 57.14% | 7 |
HCPT | 0 | 0.0% | 5 |
CUI | 118,081 | 41.66% | 283,449 |
NDDF | 4,936 | 23.16% | 21,314 |
HCPCS | 873 | 100.0% | 873 |
NCIT | 1 | 14.29% | 7 |
RXNORM | 102,965 | 96.27% | 106,949 |
NCI_CTRP | 7 | 58.33% | 12 |
PDQ | 560 | 42.26% | 1,325 |
total | 270,122 | 40.1% | 673,599 |
prefix | has OMOP ids | percent | total |
OMIM | 11,261 | 10.79% | 104,358 |
DSM5 | 2 | 100.0% | 2 |
CHV | 940 | 67.97% | 1,383 |
AQTLTrait | 0 | 0.0% | 1 |
NCI_CDISC-GLOSS | 1 | 50.0% | 2 |
HL7 | 0 | 0.0% | 1 |
HP | 4,699 | 32.78% | 14,336 |
MESH | 1 | 100.0% | 1 |
MEDDRA | 2,435 | 100.0% | 2,435 |
MTHICPC2EAE | 2 | 28.57% | 7 |
SNOMEDCT_VET | 83 | 61.94% | 134 |
NCI_FDA | 49 | 66.22% | 74 |
NANDA-I | 66 | 51.56% | 128 |
NCI_NICHD | 134 | 89.33% | 150 |
UPHENO | 1 | 100.0% | 1 |
MONDO | 3 | 100.0% | 3 |
MP | 2 | 100.0% | 2 |
SYMP | 568 | 60.17% | 944 |
EFO | 104 | 69.33% | 150 |
NCI_NCI-GLOSS | 42 | 89.36% | 47 |
LNC | 135 | 53.36% | 253 |
NBO | 124 | 41.89% | 296 |
NCI_CDISC | 7 | 46.67% | 15 |
CCS | 6 | 100.0% | 6 |
CHEMBL.TARGET | 2 | 100.0% | 2 |
NCI_CTCAE | 107 | 100.0% | 107 |
CUI | 3,933 | 31.77% | 12,379 |
ICD10AE | 11 | 84.62% | 13 |
NCIT | 70 | 3.77% | 1,858 |
NCI_GAIA | 1 | 100.0% | 1 |
NCI_CTRP | 1 | 100.0% | 1 |
total | 24,790 | 17.82% | 139,090 |
@dkoslicki, the local COHD database was updated to include the mapping of all KG1 and KG2 disease
, phenotype
, chemical_substance
, and drug
curies. The script COHDIndex.py
and overlay_clinical_info.py
were updated as well.
The updated scripts passed all non-skipped (including the ones marked with slow
) tests in test_ARAX_overlay.py
and test_ARAX_workflows.py
.
Here are some results for testing the COHDIndex
in test_ARAX_overlay
.
ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py --runslow
============================================================================ test session starts ============================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 19 items
test_ARAX_overlay.py::test_jaccard PASSED [ 5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED [ 10%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED [ 15%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED [ 21%]
test_ARAX_overlay.py::test_FET_ex1 PASSED [ 26%]
test_ARAX_overlay.py::test_FET_ex2 PASSED [ 31%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED [ 36%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED [ 42%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED [ 47%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED [ 52%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED [ 57%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED [ 63%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED [ 68%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED [ 73%]
test_ARAX_overlay.py::test_issue_832 PASSED [ 78%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED [ 84%]
test_ARAX_overlay.py::test_issue_840 PASSED [ 89%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED [ 94%]
test_ARAX_overlay.py::test_issue_892 PASSED [100%]
============================================================================= warnings summary ==============================================================================
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
/home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
/home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
/home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
/home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
test_ARAX_overlay.py::test_issue_892
/home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1300, "Invalid utf8mb4 character string: '80037D'")
result = self._query(query)
-- Docs: https://docs.pytest.org/en/latest/warnings.html
=============================================================== 19 passed, 13 warnings in 1539.35s (0:25:39) ================================================================
Please check it in branch issue875
. If there is no other problems, I think it's ready to be merged into master
and this issue can be closed.
That’s awesome @chunyuma! Do you mind doing just a bit of spot checking as well:
chi_square
, observed_expected_ratio
etc). Since the automated tests check if things don’t throw errors, but they rarely check actual valuesAfter that, a merge to master is fine with me!
@dkoslicki, here are some results of investigation.
As I said in the slack, actually it is hard to accurately compare the local version and the API version due to two reasons:
get_paired_concept_freq
, chi_square
, observed_expected_ratio
, etc more times.So here is the method that I used to do the comparison. I used six sets of disease-drug pairs and then compared the average running time of each disease-drug pair within each set for each of get_paired_concept_freq
, chi_square
and observed_expected_ratio
and the number of non-default results between the local version and the API version.
get_paired_concept_freq
MONDO:0002049
- chemical_substance
(Total number of pairs: 165)local version | API version | |
Average Time | 0.051s | 0.95s |
#non-default results | 104 | 21 |
MONDO:0019783
- chemical_substance
(Total number of pairs: 69)local version | API version | |
Average Time | 0.076s | 0.4s |
#non-default results | 35 | 0 |
get_obs_exp_ratio
MONDO:0000190
- chemical_substance
(Total number of pairs: 17)local version | API version | |
Average Time | 0.215s | 1.032s |
#non-default results | 17 | 1 |
MONDO:0005324
- chemical_substance
(Total number of pairs: 27)local version | API version | |
Average Time | 0.1s | 0.2s |
#non-default results | 16 | 2 |
get_chi_square
MONDO:0005139
- chemical_substance
(Total number of pairs: 110)local version | API version | |
Average Time | 0.1715s | 0.39s |
#non-default results | 72 | 11 |
MONDO:0006652
- chemical_substance
(Total number of pairs: 190)local version | API version | |
Average Time | 0.11s | 0.3s |
#non-default results | 119 | 24 |
This issue has already been merged into master
branch.
As we've noticed, calling out to COHD via API can be very, very slow. I reached out to Casey Ta, and he graciously sent me a dump of their entire database! It can be accessed here. He also offered a mySQL dump if that's more helpful.
In order to speed up our
overlay
queries, it would be great if this could be integrated into our system (similar to our fast_ngd ala #654).A few details:
So to accomplish this, we would want to spin up a database that:
disease
,phenotype
,chemical_substance
, anddrug
to (possibly a list) of OMOP concept ids and store this in a database.paired_concept_counts_associations.txt
and its various fields (eg.chi_square_p
,ln_ratio
, etc.)COHD_local.get_paired_concept_freq(omop1, omop2, 3)
instead ofCOHD.get_paired_concept_freq(omop1, omop2, 3)
which currently hits the COHD API.Medium high-priority as this would greatly reduce the time it takes for me (and others) to come up with good example DSL for the query_graph_interpreter_templates.yaml.
Looking for volunteers to take this task on, so I will assign many, and de-assign yourself if you don't have the bandwidth to do this.