AACR (American Association for Cancer Research) & Project Genie (Genomics Evidence Neoplasia Information Exchange)
clinical sequencing data from 19 cancer centers worldwide containing mutation and some copy number alteration data from primary and metastatic tumors pre- and post-treatment
Background biological data
networks
3D protein structure
Curated effect and therapy implications
OncoKB (Precision Oncology Knowledge Base)
CIViC
My Cancer Genome
Predicted functional effect
mutationassessor.org
PolyPhen-2
Variant recurrence
COSMIC
Cancer Hotspots
Available data types
Omic data
non-synonymous mutations
fusions
DNA copy-number data (putative, discrete values per gene, e.g. "deeply deleted" or "amplified", as well as log2 or linear copy number data)
mRNA and microRNA expression data
protein-level and phosphoprotein level data (RPPA or mass spectrometry based)
DNA methylation data
De-identified Clinical data
treatments
survival
Available cancer studies
Adrenal Gland Cancer (DOID:3953)
Brain Cancer (DOID:1319)
Bone Cancer (DOID:184)
Breast Cancer (DOID:1612)
Cardiovascular Cancer (DOID:176)
Cell Type Cancer (DOID:0050687)
Cervical Cancer (DOID:4362)
Colorectal Cancer (DOID:9256)
Endocrine Organ Benign Neoplasm (DOID:0060089)
Esophageal Cancer (DOID:5041)
Gastroesophageal Cancer (DOID:0080374)
Gastrointestinal System Benign Neoplasm (DOID:0050624)
Head and Neck Cancer (DOID:11934)
Hematologic Cancer (DOID:2531)
Hepatobiliary System Cancer (DOID:0080355)
Intestinal Cancer (DOID:10155)
Kidney Cancer (DOID:263)
Liver Cancer (DOID:3571)
Lung Cancer (DOID:1324)
Musculoskeletal System Cancer (DOID:0060100)
Nervous System Benign Neoplasm (DOID:0060115)
Oral Cavity Cancer (DOID:8618)
Ovarian Cancer (DOID:2394)
Pancreatic Cancer (DOID:1793)
Peripheral Nervous System Neoplasm (DOID:1192)
Prostate Cancer (DOID:10283)
Sensory System Cancer (DOID:0060116)
Skin Cancer (DOID:4159)
Stomach Cancer (DOID:10534)
Testicular Cancer (DOID:2998)
Thoracic Cancer (DOID:5093)
Thyroid Gland Cancer (DOID:1781)
Urinary Bladder Cancer (DOID:11054)
Uterine Cancer (DOID:363)
License
The cBioPortal software is available under an open source license via GitHub. (ref).
Datasets Page
A zip file for each study on cbioportal.org can be download from the Datasets Page. One can also use the R client cBioPortalData to programmatically download all of these files.
Datahub
The files for each study are also available from our datahub repository. This is basically the extracted version of the zip files in the Datasets Page. Note that this is a git LFS repo so if you are familiar with git you might prefer using this option.
API and API Clients
Besides downloading all the study data one can also request slices of the data using the API. A slice of the data could e.g. be "give me all the mutation data for one patient" or "get me all EGFR mutations for a particular group of samples". There are API clients available in a variety of languages including bash, R and Python. See for more information the API documentation. cBioPortal provides a REST API for programmatic access to the data (ref)
[ ] Compile all cBioPortal mutation files into one file or find a file that compiles all studies.
[ ] Create an Excel sheet with all headers from cBioPortal mutation data with 4-5 examples. Compare them with BioMuta headers (rf slides). We want to see as many headers as possible like in BioMuta.
[ ] Make note of differences, e.g. additional mapping is needed from chr to prot position, or cBioPortal mentions gene name but not UniProt ID... Pay attention to mutation type, we are only interested in non-synonymous mutations (those that lead to change in AA or stop-codon). Also look out for info on how the frequency is calculated (patient or allelic), cBioPortal gives you clinical data: # of patients, # of samples, and cancer type. Based on that, we want to calculate the patient frequency. How many patients have the same mutations for a given cancer? E.g. lung cancer, mutation at pos 119 in EGFR. How many patients have this mutation in EGFR? We need to see whether this info is available or not.
[ ] Cancer types listed are based on Disease Ontology: skin cancer = melanoma… same disease, different name. We want to uniformize cancer names using Cancer slim DO on disease ontology (see paper).
[ ] Does cBioPortal have more data than BigQuery + dbGaP?