Closed cgrisdale closed 2 years ago
MOAlmanac terms of use, including link to license: https://moalmanac.org/terms
Database content license is here: https://github.com/vanallenlab/moalmanac-db/blob/main/LICENSE
Initial exploration, looks like you can download releases as a zip of tsv files
rows per file
aneuploidy.tsv 1
copy_number.tsv 85
germline_variant.tsv 102
knockdown.tsv 9
microsatellite_stability.tsv 5
mutational_burden.tsv 6
mutational_signature.tsv 17
neoantigen_burden.tsv 1
rearrangement.tsv 70
silencing.tsv 1
somatic_variant.tsv 523
Columns (frequency across files)
{'event': 1, 'disease': 11, 'context': 11, 'oncotree_term': 11, 'oncotree_code': 11, 'therapy_name': 11, 'therapy_strategy': 11, 'therapy_type': 11, 'therapy_sensitivity': 11, 'therapy_resistance': 11, 'favorable_prognosis': 11, 'predictive_implication': 11, 'description': 11, 'preferred_assertion': 11, 'source_type': 11, 'citation': 11, 'url': 11, 'doi': 11, 'pmid': 11, 'nct': 11, 'last_updated': 11, 'gene': 5, 'direction': 1, 'cytoband': 1, 'chromosome': 2, 'start_position': 2, 'end_position': 2, 'reference_allele': 2, 'alternate_allele': 2, 'cdna_change': 2, 'protein_change': 2, 'variant_annotation': 2, 'exon': 2, 'rsid': 2, 'pathogenic': 1, 'adverse_event_risk': 1, 'technique': 2, 'status': 1, 'classification': 2, 'minimum_mutations': 1, 'mutations_per_mb': 1, 'cosmic_signature_number': 1, 'cosmic_signature_version': 1, 'minimum_neoantigens': 1, 'gene1': 1, 'gene2': 1, 'rearrangement_type': 1, 'locus': 1}
Unique values in columns where there is less than 25 different values
event ['Whole genome doubling', None]
therapy_type [nan, 'Hormone therapy', 'Radiation therapy', 'Immunotherapy', 'Targeted therapy', 'Combination therapy', 'Chemotherapy']
therapy_sensitivity [nan, 1.0, 0.0]
therapy_resistance [nan, 1.0]
favorable_prognosis [0.0, nan, 1.0]
predictive_implication ['Inferential', 'Guideline', 'Preclinical', 'Clinical evidence', 'Clinical trial', 'FDA-Approved']
preferred_assertion [nan, 1.0]
source_type ['Journal', 'Guideline', 'FDA', 'Abstract']
filename ['aneuploidy.tsv', 'copy_number.tsv', 'germline_variant.tsv', 'knockdown.tsv', 'microsatellite_stability.tsv', 'mutational_burden.tsv', 'mutational_signature.tsv', 'neoantigen_burden.tsv', 'rearrangement.tsv', 'silencing.tsv', 'somatic_variant.tsv']
direction [None, 'Amplification', 'Deletion']
cytoband [None, nan, '20q13.2', '11p13', '19q12', '1p32.3', '1p', '22q11.21', '8q24', '13', '17p13', '20q11']
chromosome [None, nan, 11.0, 19.0, 22.0, 14.0, 7.0, '9', '14', '2', 'X', '11', '7', '1', '4', '13', '10', '15', '19', '12', '22', '3', '18', '17']
reference_allele [None, nan, 'T', 'G', '-', 'C', 'TTA', 'A', 'AC']
alternate_allele [None, nan, 'G', 'A', 'C', '-', 'T', 'TT']
variant_annotation [None, nan, 'Nonsense', 'Frameshift', 'Missense', 'Splice Site', 'Oncogenic Mutations', 'Insertion', 'Deletion', 'Activating mutation']
exon [nan, 2.0, 15.0, 10.0, 3.0, 17.0, 19.0, 13.0, 7.0, 5.0, 6.0, 4.0, 23.0, 25.0, 8.0, 20.0, 21.0, 16.0, 18.0, 14.0, 11.0, 9.0, 12.0]
pathogenic [nan, 1.0]
adverse_event_risk [nan, 1.0]
technique [None, 'shRNA', 'CRISPR-Cas9', 'siRNA', 'CRSPR-Cas9']
status [None, 'MSI-High']
classification [None, 'High']
minimum_mutations [nan, 178.0, 100.0]
mutations_per_mb [nan, 10.0]
cosmic_signature_number [None, 10, 2, 3, 4, 5]
cosmic_signature_version [None, 2]
minimum_neoantigens [None, nan]
gene1 [None, 'BCR', 'ALK', 'BRD4', 'CCND1', 'CCND3', 'COL1A1', 'EML4', 'ESRP1', 'EWSR1', 'FGFR2', 'FGFR3', 'IGH', 'NTRK1', 'NTRK2', 'NTRK3', 'PDGFRA', 'FIP1L1', 'PDGFRB', 'RET', 'ROS1', 'RUNX1', 'SLC45A3', 'TMPRSS2']
gene2 [None, 'ABL1', nan, 'PDGFB', 'ALK', 'RAF1', 'FLI1', 'TACC3', 'NSD2', 'PDGFRA', 'RUNX1T1', 'BRAF', 'ERG']
rearrangement_type [None, 'Fusion', nan, 'Translocation']
locus [None, nan, 't(15;19)', 't(11;14)', 't(6;14)', 't(11;14)(q13;q32)', 't(4;14)(q16;q32)', '5q31-33', '4p12']
Seems like a lot of the columns are 0/1 boolean-like values
Introduces the interesting idea of spitting therapy and therapy strategy which might be helpful for some of the therapeutic table organization problems
therapy_strategy freq
52 PARP inhibition 77
56 PI3K/AKT/mTOR inhibition 56
53 PD-1/PD-L1 inhibition 50
11 BCR-ABL inhibition 47
24 EGFR inhibition 46
@cgrisdale can you double check my mappings make sense? I have included a summary of them in the README.md (see PR #62)
On the last test load I did against graphkbdev there was a success rate of 717/820 records (87%)
The most common errors were:
2 error: Failed to create the combination therapy (Neoadjuvant chemotherapy + surgery)
2 Error: missing Disease record where {"AND":[{"name":"Fallopian Tube"},{"sourceId":"HGSFT"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
2 Error: missing Disease record where {"AND":[{"name":"Metastatic Breast Cancer"},{"sourceId":"MBC"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
2 Error: missing Therapy record where {"OR":[{"sourceId":"GANT61"},{"name":"GANT61"}]}
2 Error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemoradiation"},{"name":"Neoadjuvant chemoradiation"}]}
2 Error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemotherapy"},{"name":"Neoadjuvant chemotherapy"}]}
2 error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemotherapy"},{"name":"Neoadjuvant chemotherapy"}]}
3 Error: Spec Validation failed for undefined #.features[0].attributes[0].feature_type should be equal to constant found '["somatic_variant"]'
4 Error: statement has no relevance
4 Error: unsupported notation type: 'GT>AA'
5 Error: missing Disease record where {"AND":[{"name":"Glioma"},{"sourceId":"GNOS"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
5 Error: Unexpected variant configuration: mutational_burden
6 Error: missing Disease record where {"AND":[{"name":"Renal Clear Cell Carcinoma"},{"sourceId":"RCC"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
7 Error: Spec Validation failed for undefined #.features[0].attributes[0].rearrangement_type should be equal to one of the allowed values found '["null"]'
20 Error: disease not given
20 Error: missing Disease record where {"AND":[{"name":"Any solid tumor"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
for the most part these are missing disease/drugs we can deal with like we do the civic ones. F
For this one
5 Error: Unexpected variant configuration: mutational_burden
I am not sure what to do with the mutational_burden variants for now so I have left them out. That is because they don't specific high/low as far as i can tell currently
7 Error: Spec Validation failed for undefined #.features[0].attributes[0].rearrangement_type should be equal to one of the allowed values found '["null"]'
This one is the result of missing information, not sure if we want to load just rearrangement types without any further specification?
3 Error: Spec Validation failed for undefined #.features[0].attributes[0].feature_type should be equal to constant found '["somatic_variant"]'
This one is the result of errors in their data
MOAlmanac is a cancer variant knowledgebase from BROAD/Dana-Farber with 820 assertions.
The API documentation is here: https://app.swaggerhub.com/apis-docs/vanallenlab/almanac-browser/0.2#/
Content can be browsed here: https://moalmanac.org/