cancer variant databases for ingest

mellybelly commented 7 years ago

see this nice list: https://github.com/seandavi/awesome-cancer-variant-databases

related to #15, @dnahotline, @mbrush @stuppie @pnrobinson can you prioritize these and make a Dipper/Wikidata/BioThings plan? Would like most available for Monarch too, so good for joint ingest planning.

pnrobinson commented 7 years ago

I everybody -- I am starting to understand how Dipper works thanks to your help. I think it would be good to have a discussion of how we want to set up the model. There are a few issues with data ingest such as the fact that OncoKB just has gene symbol and protein-mutation (e.g., A423P), and we would prefer to have the genomic coordinates as well (and perhaps a preferred transcript for the mutations). This is addressable but will require some finessing. I am hoping to extend the python hgvs to be able to go from "p." to "c."; Reece said that going from "c." to "g." is now functioning well. Another useful thing would be to add the cancer type to the OncoKB data by pulling it from the abstracts. I am wondering if it would be useful for us to get in contact with that group and propose they improve/extend their data model.

stuppie commented 7 years ago

For dgidb, see ticket here: https://github.com/monarch-initiative/dipper/issues/446 And direct communication with dgidb team here: https://github.com/griffithlab/dgi-db/issues/141 and https://github.com/griffithlab/dgi-db/issues/142 As a side note: DGIdb includes CIViC, DoCM, MyCancerGenome

mbrush commented 7 years ago

'Cancer variant databases' covers a broad and diverse domain. The summary below attempts to tease out some of the different datatypes to consider in this space, and begins a list of data sources to consider/prioritize. Others feel free to add/modify as needed - I think you can directly edit my comment here if you’d like.

Data Types

Primary Variant 'Associations'

Pathogenicity / Diagnostic - describes a variant's causation of / correlation with disease (and therefore its diagnostic utility)
Predictive - describes a variant's impact on response to treatment for a particular disease
Prognostic - describes a variant’s impact on disease progression, severity, or patient survival
Predisposition - describes how a variant may confer susceptibility to disease
Functional Effect - describes impact of variant on protein function
Drug Interaction - describes molecular interactions between gene products and drugs
Expression - describes aspects of gene expression in different conditions

Additional Variant 'Metadata'

Variant identifiers - no universal system/authority here. identifier usage is inconsistent and sparse. a. ClinVar and dbSNP are databases that provide variant identifiers, but many variants not registered here, and most other soruces do not reference these identifiers. b. Often a HGVS label is all we get, and often we only get protein level name (e.g. K99E) that must be translated to genomic variant name. c. Efforts such as the ClinGen Allele Registry and GA4GH/VICC VMC model are working on solutions we could potentially adopt/partner with.
Variant position - as noted, position is often bundled in with the variant's name or id (e.g. when HGVS is used). As metadata, position can be specified in different contexts (e.g. protein- vs -transcript - vs chromosome), which may need to be translated between.
Variant origin - germline vs somatic origin of the mutation
Association evidence/provenance - sources, methods, and data used to support a variant association

Cancer Variant Data Sources

ClinVar - https://www.ncbi.nlm.nih.gov/clinvar/
CIViC - https://civic.genome.wustl.edu/home
Oncokb Precision Oncology Knowledge Base - http://oncokb.org/#/DoCM
Jackson Lab Clinical Knowledge Base - https://www.jax.org/clinical-genomics/clinical-offerings/ckb
Cancer Genome Interpreter - https://www.cancergenomeinterpreter.org/home
Cancer Driver Log (CanDL) - https://candl.osu.edu/
Precision Medicine KnowledgeBase (PMKB) - https://pmkb.weill.cornell.edu/
MyCancerGenome - https://www.mycancergenome.org/
COSMIC - http://cancer.sanger.ac.uk/cell_lines
The Cancer Genome Atlas (TCGA) - https://cancergenome.nih.gov/
International Cancer Genomic Consortium (ICGC) - http://icgc.org/
Drug-Gene Interaction Database (DGIdb) - http://dgidb.genome.wustl.edu/
cBioPortal (integrates various resources, including views on many of the above) - http://www.cbioportal.org/
MolecularMatch (commercial - not sure what is open access for research users) - https://www.molecularmatch.com/

Next Steps

Start to ingest from known sources such as ClinVar, CIViC and OncoKB - using data here to inform initial data models and scripts.
Perform a high-level landscape analysis of the sources above, and document things like the data types each has to offer, standards and identifiers systems they use, terms of licensing/re-use, data currency and access mechanisms, etc. I can start this work in a google spreadsheet where anyone can view or contribute.
Make plan to coordinate data exploration, modeling, parsing/ingest tasks with Wikidata team.
Develop use cases and competency questions for this data, with goal of demonstrating what value is added by integrating this knowledge and adding ontological support for query and inference.

NCATS-Tangerine / ncats-ingest