Open andrewsu opened 5 years ago
Looks like plenty of data cleaning will be needed. number of DID_ID: 191111 number of unique raw drug names: 34137 number of unique umls preferred drug terms: 21807 number of raw predicates (not unique, not null): 140181 number of unique umls preferred indication terms: 6111 number of DID entries with a predicate value: 106762 number of DID entries where the predicate is a "marker/mechanism": 42411 number of entries with predicate values that aren't 'marker/mechanism': 62913 number of WD entities pulled by CAS number from DIDs with predicates: 9360 number of WD entities pulled by UMLS "drug" CUIS from DIDs with predicates: 2077 number of WD entities pulled by UMLS "phenotype" CUIS from DIDs with predicates: 1717
Load data from Merck's Drug Indication Database (DID), which is explicitly CC0-licensed: https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0110-0
192k indications aggregated from many other public data sources in excel file. will likely need some data cleaning.
See also: https://github.com/SuLab/GeneWikiCentral/issues/87