Local copy and implementation of COHD

dkoslicki commented 4 years ago

As we've noticed, calling out to COHD via API can be very, very slow. I reached out to Casey Ta, and he graciously sent me a dump of their entire database! It can be accessed here. He also offered a mySQL dump if that's more helpful.

In order to speed up our overlay queries, it would be great if this could be integrated into our system (similar to our fast_ngd ala #654).

A few details:

COHD uses OMOP concept ID's for its identifiers
This requires me to hit the endpoint xrefToOMOP which basically just uses oxo to map CURIES to OMOP concept ids. Example in our code here

So to accomplish this, we would want to spin up a database that:

[x] Has already hit OXO to map all KG1 and KG2 disease, phenotype, chemical_substance, and drug to (possibly a list) of OMOP concept ids and store this in a database.
[x] Populate the database with the paired_concept_counts_associations.txt and its various fields (eg. chi_square_p, ln_ratio, etc.)
[x] create a class that allows for a transparent call to this database so that we can drop in like COHD_local.get_paired_concept_freq(omop1, omop2, 3) instead of COHD.get_paired_concept_freq(omop1, omop2, 3) which currently hits the COHD API.
[x] "script-ify" everything so that when KG2 and/or COHD is updated, the backend database can be updated as well (and so devs (like me) can have local copies of the database as needed as well).

Medium high-priority as this would greatly reduce the time it takes for me (and others) to come up with good example DSL for the query_graph_interpreter_templates.yaml.

Looking for volunteers to take this task on, so I will assign many, and de-assign yourself if you don't have the bandwidth to do this.

dkoslicki commented 4 years ago

Note, there's also an RDF triple store dump, so there's an off chance that we could get OMOP concepts into KG2 directly and be able to skip the OXO mapping part, but I would need @saramsey to comment about if this is feasible (since we would need to connect the diseases, drugs, etc. to the OMOP nodes).

From Casey

Vincent has also done a bunch of work converting the COHD data into RDF modeled in Biolink. If you're interested, you can query that at https://trek.semanticscience.org/ or hit the download link to get the RDF dumps.

chunyuma commented 4 years ago

@dkoslicki, I think I have time to do it. I have already gone through the scripts under Overlay folder so I know how the overlay_clinical_info.py works. One question for this is where we should store this database.

chunyuma commented 4 years ago

"script-ify" everything so that when KG2 and/or COHD is updated, the backend database can be updated as well (and so devs (like me) can have local copies of the database as needed as well).

I'm a little bit confused about this. Do we want to create a script to automatically pull down the data from KG2 and/or COHD when they are updated and then integrated them into the local database? When COHD is updated, will we have the updated dump file of the entire database?

dkoslicki commented 4 years ago

@chunyuma Excellent! Thanks for volunteering!!

Re: “script-ify” I basically mean: create scripts that will automate every step of the process. That way, it can be deployed anywhere as needed. Similar to how python3 KGNodeIndex.py -b auto-creates the KGNode index, and can be run on local dev machines, or prod servers as needed.

As for where the database is to be stored, I would say: get the scripts/classes/etc. up and running on your own instance, and then after everything is checked that it works, we can work with Eric to get it created on the production server as well. I’d suggest doing this in a different branch so we can seamlessly integrate into master after you’ve got everything put together and confirmed it works.

Minor details:

You may want to make it such that the code auto-downloads the data dump or has a CLI option to point to a different dump (as this won’t be stored in github) that way we can “future-proof” for new releases of COHD.
The OXO mapping of disease, drug, etc. nodes to OMOP concept ids will probably be non-trivial and take quite a while, so making this a script as well makes sense
As for where all these scripts will live, a natural place seems to be code/ARAX/KnowledgeSources under a folder you create with a name something like “COHD_local”
Your scripts may/will download data to that “COHD_local” folder, but they shouldn’t be checked into github (they will be too large), but the drop in replacement (eg. COHD_local.get_paired_concept_freq(omop1, omop2, 3) will know to look there for the data) similar to how KGNodeIndex creates a MySQL database that exists, but isn’t checked into github.

chunyuma commented 4 years ago

@dkoslicki, it seems like the data dump can't be downloaded by the code. I suspect this might be due to the encrypted characteristics of pCloud which only allows the users to download the data from the browser even though it can share the data with different people. I tried curl, wget and requests module in python. None of them can work for downloading the data. We can write a script to auto-download the data but it must be allowed to be downloaded by the code.

dkoslicki commented 4 years ago

@chunyuma Given the data can't be downloaded with wget or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly): "downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",

Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.

saramsey commented 4 years ago

@chunyuma Given the data can't be downloaded with wget or the like (without some workarounds: if you look at the source of the data dump, on line 1013 the link is provided (but would need to be parsed properly): "downloadlink": "https:\/\/p-def7.pcloud.com\/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX\/cohd-v2.tar.gz",

Let's not worry about this for now, as later we can work with Casey Ta to see if he can place it somewhere that we can download more easily.

This link no longer seems to work:

wget https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
--2020-07-08 16:06:53--  https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZv0rJ37Z2ZZGoFZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZC5SrMxjQv8pIjBPE02dHTkR05OyX/cohd-v2.tar.gz
Resolving p-def7.pcloud.com (p-def7.pcloud.com)... 74.120.9.15
Connecting to p-def7.pcloud.com (p-def7.pcloud.com)|74.120.9.15|:443... connected.
HTTP request sent, awaiting response... 410 Gone
2020-07-08 16:06:53 ERROR 410: Gone.

dkoslicki commented 4 years ago

Odd. I guess after we get a local system up and running, we work with Casey to make the data dump persistent & accessible

chunyuma commented 4 years ago

@saramsey, it works with this link

https://p-def7.pcloud.com/cBZE9ntPmZdY1gS2ZZZldlF37Z2ZZF00ZkZELjcdpZupZapZDFZgFZAFZcJZ0pZspZxFZQpZu5ZVkZ57ZvFZ3ibtkZHDNaREulz4yiRznCQ1HJb05lPECy/cohd-v2.tar.gz

But it's only allowable to download via browser but not wget

chunyuma commented 4 years ago

Update: The local COHD database was established and stored in arax.rtx.ai and the new script COHDIndex.py can replace all functions of the original script QueryCOHD.py used in overlay_clinical_info.py. I carefully compared the results from both COHDIndex and QueryCOHD. They return the same results and also applying COHDIndex is significantly faster than QueryCOHD in overlay(action=overlay_clinical_info,...).

I also modified overlay_clinical_info.py in issue875 branch to make it compatible with COHDIndex.py. The COHDIndex passed all tests in test_ARAX_overlay.py. Here is the result:

(RTX_env) ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py
===================================================================== test session starts ======================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 18 items

test_ARAX_overlay.py::test_jaccard PASSED                                                                                                                [  5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED                                                                                                         [ 11%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED                                                                                                    [ 16%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED                                                                                                  [ 22%]
test_ARAX_overlay.py::test_FET_ex1 PASSED                                                                                                                [ 27%]
test_ARAX_overlay.py::test_FET_ex2 PASSED                                                                                                                [ 33%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED                                                                                       [ 38%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED                                                                                     [ 44%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED                                                                                        [ 50%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED                                                                                      [ 55%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED                                                                                                     [ 61%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED                                                                                                   [ 66%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED                                                                                    [ 72%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED                                                                                  [ 77%]
test_ARAX_overlay.py::test_issue_832 PASSED                                                                                                              [ 83%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED                                                                                                     [ 88%]
test_ARAX_overlay.py::test_issue_840 PASSED                                                                                                                   [ 94%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED                                                                                                          [100%]

Currently, the COHDIndex only works for KG1.

The next steps needed to be done:

[x] make the implementation of all functions of QueryCOHD.py in COHDIndex
[x] finish the mapping of KG2 disease, phenotype, chemical_substance, and drug to (possibly a list) of OMOP concept ids and store this in the database

Once these steps are done, it can be merged into master branch

saramsey commented 4 years ago

This link goes to the download page at the Athena portal for downloading mappings between OMOP and other biomedical vocabularies like ICD10, MeSH, SNOMED, LOINC, RXNORM, etc.

https://athena.ohdsi.org/vocabulary/list

dkoslicki commented 4 years ago

So @chunyuma plan is: use node_synonymizer.py -l <curie> -k KG2 to lookup all synonyms that map to the biomedical vocabularies that map to OMOP concept ID's in the link Steve gave above. Use this to populate the CURIE to OMOP mappings in your sqlite database

chunyuma commented 4 years ago

Do we need CDT (Current Dental Terminology (ADA)) data and MedDRA (Medical Dictionary for Regulatory Activities (MSSO)) data? From the NodeSynonymizer results, we have 156982 curies which have MedDRA synonym and 6 curies which have CDT. However, these data from Athena portal need license to download.

dkoslicki commented 4 years ago

@chunyuma do these CDT and MedDRA curies correspond to drugs, chemical_substances, and/or phenotypic_features? If so, then the 156K MedDRA seem the most important to get (missing 5 curies from CDT doesn't seem like too big of a deal). Perhaps @saramsey already has a license for CDT since, if it's in the NodeSynonymizer, it must be in KG2 somewhere...

chunyuma commented 4 years ago

In these 156K MedDRA, 2273 are drug, 1395 are chemical_substance, 119382 are disease and 33932 are phenotypic_feature.

In those 6 CDT, drug has 2, chemical_substance has 3 and phenotypic_feature has 1.

I actually sent an email to ohdsi (Observational Health Data Sciences and Informatics) and they told me that for MedDRA, we as a academic institution might get a free license from the MSSO.

dkoslicki commented 4 years ago

@chunyuma great that we might get a free license! Check with @saramsey if we already do or do not have a license, and if not, let us (me and/or Steve) know how we can help with getting the license for MedDRA

chunyuma commented 4 years ago

@dkoslicki, it seems like we should have the license for MedDRA already because I saw @saramsey has ever posted an associated issue #891 about MedDRA. We can check it with @saramsey in AHM meeting on Wednesday. If we don't have, perhaps I might need the help from you or Steve to get this license.

I'm actually running the script to map the curies to OMOP ids based on the data I can get from Athena portal. So this might need to take a few days. Once we have the MedDRA data, we can just simply add it to the existing data.

chunyuma commented 4 years ago

Update: Since Steve told me that we don't have a license to download the MedDRA data from Athena portal, I'm trying to contact MedDRA MSSO Help Desk to see if we can get a free, special license.

Based on other data that I can download from Athena portal, I mapped all KG1 and KG2 disease, phenotype, chemical_substance, and drug to (possibly a list) of OMOP concept ids. Here is the summary:

KG1: total 33,888 disease, phenotype, chemical_substance nodes of which 18,981 nodes were found to have at least one OMOP ids via COHD API while 15,253 nodes were found via the data downloaded from Athena portal (excluded MedDRA data).

KG2: total 3,229,158 disease, phenotype, chemical_substance and drug nodes of which 443,933 nodes were found to have at least one OMOP ids via the data download from Athena portal (excluded MedDRA data).

dkoslicki commented 4 years ago

Thanks for the update @chunyuma ! Let us know if you run into any problems with the MedDRA license, since now that you have mapped many of the nodes to OMOP ID's, if they don't want to give you a license, at least we could hit COHD for the remaining unmapped curies (which is now much less due to your work) 156K calls is definitely better than millions.

chunyuma commented 4 years ago

@dkoslicki, for KG2, only 443,933 nodes were mapped to at least one OMOP ID's. This means that the remaining unmapped curies still have around 2.7 millions (we have total 3,229,158 disease, phenotype, chemical_substance and drug nodes in KG2). So I don't think it it practical to call COHD API for these remaining 2.7 millions. Actually, I guess it is possible that there are many of these remaining unmapped curies which indeed don't have mapped OMOP ids even though we call COHD API.

Also, if we consider the synonyms of these remaining curies when we call COHD API. It might be much larger than 2.7 millions.

chunyuma commented 4 years ago

@dkoslicki, actually not all those 156K MedDRA-associated curies have no OMOP ids.

Let's say, there is a curie which has multiple synonyms from NodeSynonymizer. They includes Mesh:xxxx, ATC:xxx, ICD10CM:xxx and MedDRA:xxx and etc. And Mesh:xxxx, ATC:xxx, ICD10CM:xxx have their corresponding OMOP ids. Then this curie would already have three OMOP ids. Even though we can download MedDRA data, we might just add more OMOP ids to the existing OMOP list of the curies.

chunyuma commented 4 years ago

Update: After including MedDRA data, the number of nodes with at least one OMOP ids in KG1 is 15,842 while the number of nodes with at least one OMOP ids in KG2 is 480,123.

@dkoslicki, if we don't need to hit COHD API for the remaining unmapped curies, then I can start to build the database.

chunyuma commented 4 years ago

Update: After hitting OxO API, the number of nodes with at least one OMOP ids in KG1 increases to 21,148 while the number of nodes with at least one OMOP ids in KG2 increases to 502,605.

dkoslicki commented 4 years ago

Awesome, thanks @chunyuma! For each of the relevant node types (eg. chemical_substance, disease, etc.), what is that in terms of percent? I.e. disease: X% of nodes with at least one OMOP id

chunyuma commented 4 years ago

@dkoslicki, here is some statistics for KG1 and KG2.

KG1


type	number	percent
chemical_substance	2,226	6.57%
disease	19573	57.76%
phenotypic_feature	12089	35.67%
total	33,888	100%

Within chemical_substance


prefix	has OMOP ids	percent	total
CHEMBL.COMPOUND	2,104	94.52%	2,226
total	2,104	94.52%	2,226

Within disease


prefix	has OMOP ids	percent	total
DOID	8,132	73.31%	11,092
MONDO	1	100%	1
OMIM	6,597	77.79%	8,480
total	14,730	75.26%	19,573

Within phenotypic_feature


prefix	has OMOP ids	percent	total
AQTLTrait	0	0%	1
HP	4,314	35.69%	12,088
total	4,314	35.69%	12,089

KG2


type	number	percent
chemical_substance	2,164,168	67.02%
disease	252,301	7.81%
drug	673,599	20.86%
phenotypic_feature	139,090	4.31%
total	3,229,158	100%

Within chemical_substance


prefix	has OMOP ids	percent	total
CHEBI	8,575	7.61%	112,632
GS	7	1.31%	536
FOODON	1	7.69%	13
CHV	708	38.33%	1,847
NCI_DICOM	0	0.0%	1
NCI_CDISC-GLOSS	0	0.0%	4
HCDT	0	0.0%	1
HL7	1	1.96%	51
MESH	1,680	1.72%	97,674
CHEMBL.COMPOUND	12,835	0.71%	1,795,373
MTHSPL	2,081	66.51%	3,129
NCI_DCP	11	27.5%	40
USP	118	89.39%	132
TTD	106	21.33%	497
SNOMEDCT_VET	0	0.0%	5
NCI_FDA	533	49.03%	1,087
NANDA-I	0	0.0%	3
NCI_DTP	11	39.29%	28
BTO	0	0.0%	1
NCI_NICHD	0	0.0%	2
NCI_CRCH	49	24.75%	198
EFO	26	34.67%	75
NCI_NCI-GLOSS	56	34.36%	163
LNC	988	44.71%	2,210
GTPI	58	2.4%	2,412
VANDF	213	33.86%	629
DRUGBANK	147	39.95%	368
NCI_CDISC	60	31.91%	188
NCI_NCPDP	168	46.93%	358
CHEMBL.TARGET	20	80.0%	25
CUI	10,359	7.52%	137,764
NDDF	492	20.26%	2,429
HCPCS	162	100.0%	162
NCIT	0	0.0%	3
RXNORM	3,454	99.97%	3,455
OBO	0	0.0%	618
PDQ	10	18.18%	55
total	42,929	1.98%	2,164,168

Within disease


prefix	has OMOP ids	percent	total
OMIM	0	0.0%	3
DSM5	156	21.67%	720
CHV	6,810	73.13%	9,312
NCI_CDISC-GLOSS	1	100.0%	1
HL7	1	100.0%	1
MESH	4,163	69.15%	6,020
MEDDRA	22,274	100.0%	22,274
DOID	8,656	73.21%	11,823
MTHICPC2EAE	7	53.85%	13
SNOMEDCT_VET	710	32.98%	2,153
NCI_CTCAE_3	1	100.0%	1
NCI_FDA	151	73.3%	206
NANDA-I	193	73.11%	264
NCI_NICHD	2,002	71.81%	2,788
SO	61	2.94%	2,073
MONDO	16,415	75.96%	21,609
EFO	2,905	84.47%	3,439
NCI_NCI-GLOSS	729	57.45%	1,269
LNC	755	48.06%	1,571
NCI_CDISC	85	20.94%	406
NCI_CTEP-SDC	97	39.92%	243
CCS	48	28.24%	170
NCI_RENI	0	0.0%	164
HCPT	0	0.0%	1
NCI_CTCAE	378	99.74%	379
CUI	81,703	60.5%	135,050
ICD10AE	338	55.87%	605
PDQ	16	23.88%	67
Orphanet	7,712	80.94%	9,528
NCIT	7,887	45.96%	17,161
NCI_GAIA	4	100.0%	4
NCI_CTRP	504	17.6%	2,863
NCI_KEGG	1	100.0%	1
MTHICPC2ICD10AE	1	0.84%	119
total	164,764	65.3%	252,301

Within drug


prefix	has OMOP ids	percent	total
CHEBI	151	53.93%	280
GS	196	0.5%	38,843
CHV	4,027	49.44%	8,146
NCI_CDISC-GLOSS	1	16.67%	6
NCI_DICOM	0	0.0%	1
HL7	0	0.0%	2
MESH	2,949	5.51%	53,493
MTHSPL	19,727	16.58%	118,987
NCI_DCP	379	54.3%	698
USP	1,824	57.94%	3,148
SNOMEDCT_VET	35	48.61%	72
NCI_BRIDG	0	0.0%	1
NCI_FDA	4,863	47.2%	10,302
NANDA-I	0	0.0%	1
NCI_DTP	176	32.18%	547
NCI_NICHD	7	43.75%	16
NCI_CRCH	70	70.0%	100
EFO	1	50.0%	2
NCI_NCI-GLOSS	806	50.63%	1,592
LNC	1,651	59.15%	2,791
VANDF	3,186	19.89%	16,022
DRUGBANK	2,646	57.42%	4,608
NCI_CDISC	4	57.14%	7
HCPT	0	0.0%	5
CUI	118,081	41.66%	283,449
NDDF	4,936	23.16%	21,314
HCPCS	873	100.0%	873
NCIT	1	14.29%	7
RXNORM	102,965	96.27%	106,949
NCI_CTRP	7	58.33%	12
PDQ	560	42.26%	1,325
total	270,122	40.1%	673,599

Within phenotypic_feature


prefix	has OMOP ids	percent	total
OMIM	11,261	10.79%	104,358
DSM5	2	100.0%	2
CHV	940	67.97%	1,383
AQTLTrait	0	0.0%	1
NCI_CDISC-GLOSS	1	50.0%	2
HL7	0	0.0%	1
HP	4,699	32.78%	14,336
MESH	1	100.0%	1
MEDDRA	2,435	100.0%	2,435
MTHICPC2EAE	2	28.57%	7
SNOMEDCT_VET	83	61.94%	134
NCI_FDA	49	66.22%	74
NANDA-I	66	51.56%	128
NCI_NICHD	134	89.33%	150
UPHENO	1	100.0%	1
MONDO	3	100.0%	3
MP	2	100.0%	2
SYMP	568	60.17%	944
EFO	104	69.33%	150
NCI_NCI-GLOSS	42	89.36%	47
LNC	135	53.36%	253
NBO	124	41.89%	296
NCI_CDISC	7	46.67%	15
CCS	6	100.0%	6
CHEMBL.TARGET	2	100.0%	2
NCI_CTCAE	107	100.0%	107
CUI	3,933	31.77%	12,379
ICD10AE	11	84.62%	13
NCIT	70	3.77%	1,858
NCI_GAIA	1	100.0%	1
NCI_CTRP	1	100.0%	1
total	24,790	17.82%	139,090

chunyuma commented 4 years ago

@dkoslicki, the local COHD database was updated to include the mapping of all KG1 and KG2 disease, phenotype, chemical_substance, and drug curies. The script COHDIndex.py and overlay_clinical_info.py were updated as well.

The updated scripts passed all non-skipped (including the ones marked with slow) tests in test_ARAX_overlay.py and test_ARAX_workflows.py.

Here are some results for testing the COHDIndex in test_ARAX_overlay.

ubuntu@ip-172-31-10-28:~/work/RTX/code/ARAX/test$ pytest -v test_ARAX_overlay.py --runslow
============================================================================ test session starts ============================================================================
platform linux -- Python 3.7.7, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 -- /home/ubuntu/miniconda3/envs/RTX_env/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/work/RTX/code/ARAX/test
collected 19 items                                                                                                                                                          

test_ARAX_overlay.py::test_jaccard PASSED                                                                                                                             [  5%]
test_ARAX_overlay.py::test_add_node_pmids PASSED                                                                                                                      [ 10%]
test_ARAX_overlay.py::test_compute_ngd_virtual PASSED                                                                                                                 [ 15%]
test_ARAX_overlay.py::test_compute_ngd_attribute PASSED                                                                                                               [ 21%]
test_ARAX_overlay.py::test_FET_ex1 PASSED                                                                                                                             [ 26%]
test_ARAX_overlay.py::test_FET_ex2 PASSED                                                                                                                             [ 31%]
test_ARAX_overlay.py::test_paired_concept_frequency_virtual PASSED                                                                                                    [ 36%]
test_ARAX_overlay.py::test_paired_concept_frequency_attribute PASSED                                                                                                  [ 42%]
test_ARAX_overlay.py::test_observed_expected_ratio_virtual PASSED                                                                                                     [ 47%]
test_ARAX_overlay.py::test_observed_expected_ratio_attribute PASSED                                                                                                   [ 52%]
test_ARAX_overlay.py::test_chi_square_virtual PASSED                                                                                                                  [ 57%]
test_ARAX_overlay.py::test_chi_square_attribute PASSED                                                                                                                [ 63%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual PASSED                                                                                                 [ 68%]
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute PASSED                                                                                               [ 73%]
test_ARAX_overlay.py::test_issue_832 PASSED                                                                                                                           [ 78%]
test_ARAX_overlay.py::test_issue_832_non_drug PASSED                                                                                                                  [ 84%]
test_ARAX_overlay.py::test_issue_840 PASSED                                                                                                                           [ 89%]
test_ARAX_overlay.py::test_issue_840_non_drug PASSED                                                                                                                  [ 94%]
test_ARAX_overlay.py::test_issue_892 PASSED                                                                                                                           [100%]

============================================================================= warnings summary ==============================================================================
test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
    warnings.warn(msg, category=FutureWarning)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

test_ARAX_overlay.py::test_predict_drug_treats_disease_virtual
test_ARAX_overlay.py::test_predict_drug_treats_disease_attribute
test_ARAX_overlay.py::test_issue_832
test_ARAX_overlay.py::test_issue_832_non_drug
test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.22.2.post1 when using version 0.22. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)

test_ARAX_overlay.py::test_issue_892
  /home/ubuntu/miniconda3/envs/RTX_env/lib/python3.7/site-packages/pymysql/cursors.py:170: Warning: (1300, "Invalid utf8mb4 character string: '80037D'")
    result = self._query(query)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
=============================================================== 19 passed, 13 warnings in 1539.35s (0:25:39) ================================================================

Please check it in branch issue875. If there is no other problems, I think it's ready to be merged into master and this issue can be closed.

dkoslicki commented 4 years ago

That’s awesome @chunyuma! Do you mind doing just a bit of spot checking as well:

[x] take a few drugs and diseases and see if the local version and the API version return similar (enough) numbers for each of the functionalities of COHD (eg. chi_square, observed_expected_ratio etc). Since the automated tests check if things don’t throw errors, but they rarely check actual values
[x] Verify that the local version is indeed faster than the API version (I completely believe it is, but would be nice to have hard data to back this up)

After that, a merge to master is fine with me!

chunyuma commented 4 years ago

@dkoslicki, here are some results of investigation.

As I said in the slack, actually it is hard to accurately compare the local version and the API version due to two reasons:

The local version has more mappings than the API version because we also considered the synonyms. This might cause the local version to call get_paired_concept_freq, chi_square, observed_expected_ratio, etc more times.
The time of hitting API is unstable probably due to the cache issue (the cache remembers the result within certain time interval when I called the API for the same curie multiple times)

So here is the method that I used to do the comparison. I used six sets of disease-drug pairs and then compared the average running time of each disease-drug pair within each set for each of get_paired_concept_freq, chi_square and observed_expected_ratio and the number of non-default results between the local version and the API version.

Test for `get_paired_concept_freq`

`MONDO:0002049` - `chemical_substance` (Total number of pairs: 165)


local version	API version
Average Time	0.051s	0.95s
#non-default results	104	21

`MONDO:0019783` - `chemical_substance` (Total number of pairs: 69)


local version	API version
Average Time	0.076s	0.4s
#non-default results	35	0

Test for `get_obs_exp_ratio`

`MONDO:0000190` - `chemical_substance` (Total number of pairs: 17)


local version	API version
Average Time	0.215s	1.032s
#non-default results	17	1

`MONDO:0005324` - `chemical_substance` (Total number of pairs: 27)


local version	API version
Average Time	0.1s	0.2s
#non-default results	16	2

Test for `get_chi_square`

`MONDO:0005139` - `chemical_substance` (Total number of pairs: 110)


local version	API version
Average Time	0.1715s	0.39s
#non-default results	72	11

`MONDO:0006652` - `chemical_substance` (Total number of pairs: 190)


local version	API version
Average Time	0.11s	0.3s
#non-default results	119	24

chunyuma commented 4 years ago

This issue has already been merged into master branch.

RTXteam / RTX