bcgsc / pori_graphkb_loader

The Loaders for GraphKB. Imports content from external sources via the GraphKB REST API
https://bcgsc.github.io/pori
GNU General Public License v3.0
6 stars 4 forks source link

Write loader for Molecular Oncology Almanac knowledgebase #38

Closed cgrisdale closed 2 years ago

cgrisdale commented 2 years ago

MOAlmanac is a cancer variant knowledgebase from BROAD/Dana-Farber with 820 assertions.

The API documentation is here: https://app.swaggerhub.com/apis-docs/vanallenlab/almanac-browser/0.2#/

Content can be browsed here: https://moalmanac.org/

cgrisdale commented 2 years ago

MOAlmanac terms of use, including link to license: https://moalmanac.org/terms

Database content license is here: https://github.com/vanallenlab/moalmanac-db/blob/main/LICENSE

creisle commented 2 years ago

Initial exploration, looks like you can download releases as a zip of tsv files

rows per file

aneuploidy.tsv 1
copy_number.tsv 85
germline_variant.tsv 102
knockdown.tsv 9
microsatellite_stability.tsv 5
mutational_burden.tsv 6
mutational_signature.tsv 17
neoantigen_burden.tsv 1
rearrangement.tsv 70
silencing.tsv 1
somatic_variant.tsv 523

Columns (frequency across files)

{'event': 1, 'disease': 11, 'context': 11, 'oncotree_term': 11, 'oncotree_code': 11, 'therapy_name': 11, 'therapy_strategy': 11, 'therapy_type': 11, 'therapy_sensitivity': 11, 'therapy_resistance': 11, 'favorable_prognosis': 11, 'predictive_implication': 11, 'description': 11, 'preferred_assertion': 11, 'source_type': 11, 'citation': 11, 'url': 11, 'doi': 11, 'pmid': 11, 'nct': 11, 'last_updated': 11, 'gene': 5, 'direction': 1, 'cytoband': 1, 'chromosome': 2, 'start_position': 2, 'end_position': 2, 'reference_allele': 2, 'alternate_allele': 2, 'cdna_change': 2, 'protein_change': 2, 'variant_annotation': 2, 'exon': 2, 'rsid': 2, 'pathogenic': 1, 'adverse_event_risk': 1, 'technique': 2, 'status': 1, 'classification': 2, 'minimum_mutations': 1, 'mutations_per_mb': 1, 'cosmic_signature_number': 1, 'cosmic_signature_version': 1, 'minimum_neoantigens': 1, 'gene1': 1, 'gene2': 1, 'rearrangement_type': 1, 'locus': 1}

Unique values in columns where there is less than 25 different values

event ['Whole genome doubling', None]
therapy_type [nan, 'Hormone therapy', 'Radiation therapy', 'Immunotherapy', 'Targeted therapy', 'Combination therapy', 'Chemotherapy']
therapy_sensitivity [nan, 1.0, 0.0]
therapy_resistance [nan, 1.0]
favorable_prognosis [0.0, nan, 1.0]
predictive_implication ['Inferential', 'Guideline', 'Preclinical', 'Clinical evidence', 'Clinical trial', 'FDA-Approved']
preferred_assertion [nan, 1.0]
source_type ['Journal', 'Guideline', 'FDA', 'Abstract']
filename ['aneuploidy.tsv', 'copy_number.tsv', 'germline_variant.tsv', 'knockdown.tsv', 'microsatellite_stability.tsv', 'mutational_burden.tsv', 'mutational_signature.tsv', 'neoantigen_burden.tsv', 'rearrangement.tsv', 'silencing.tsv', 'somatic_variant.tsv']
direction [None, 'Amplification', 'Deletion']
cytoband [None, nan, '20q13.2', '11p13', '19q12', '1p32.3', '1p', '22q11.21', '8q24', '13', '17p13', '20q11']
chromosome [None, nan, 11.0, 19.0, 22.0, 14.0, 7.0, '9', '14', '2', 'X', '11', '7', '1', '4', '13', '10', '15', '19', '12', '22', '3', '18', '17']
reference_allele [None, nan, 'T', 'G', '-', 'C', 'TTA', 'A', 'AC']
alternate_allele [None, nan, 'G', 'A', 'C', '-', 'T', 'TT']
variant_annotation [None, nan, 'Nonsense', 'Frameshift', 'Missense', 'Splice Site', 'Oncogenic Mutations', 'Insertion', 'Deletion', 'Activating mutation']
exon [nan, 2.0, 15.0, 10.0, 3.0, 17.0, 19.0, 13.0, 7.0, 5.0, 6.0, 4.0, 23.0, 25.0, 8.0, 20.0, 21.0, 16.0, 18.0, 14.0, 11.0, 9.0, 12.0]
pathogenic [nan, 1.0]
adverse_event_risk [nan, 1.0]
technique [None, 'shRNA', 'CRISPR-Cas9', 'siRNA', 'CRSPR-Cas9']
status [None, 'MSI-High']
classification [None, 'High']
minimum_mutations [nan, 178.0, 100.0]
mutations_per_mb [nan, 10.0]
cosmic_signature_number [None, 10, 2, 3, 4, 5]
cosmic_signature_version [None, 2]
minimum_neoantigens [None, nan]
gene1 [None, 'BCR', 'ALK', 'BRD4', 'CCND1', 'CCND3', 'COL1A1', 'EML4', 'ESRP1', 'EWSR1', 'FGFR2', 'FGFR3', 'IGH', 'NTRK1', 'NTRK2', 'NTRK3', 'PDGFRA', 'FIP1L1', 'PDGFRB', 'RET', 'ROS1', 'RUNX1', 'SLC45A3', 'TMPRSS2']
gene2 [None, 'ABL1', nan, 'PDGFB', 'ALK', 'RAF1', 'FLI1', 'TACC3', 'NSD2', 'PDGFRA', 'RUNX1T1', 'BRAF', 'ERG']
rearrangement_type [None, 'Fusion', nan, 'Translocation']
locus [None, nan, 't(15;19)', 't(11;14)', 't(6;14)', 't(11;14)(q13;q32)', 't(4;14)(q16;q32)', '5q31-33', '4p12']

Seems like a lot of the columns are 0/1 boolean-like values

creisle commented 2 years ago

Introduces the interesting idea of spitting therapy and therapy strategy which might be helpful for some of the therapeutic table organization problems

            therapy_strategy  freq
52           PARP inhibition    77
56  PI3K/AKT/mTOR inhibition    56
53     PD-1/PD-L1 inhibition    50
11        BCR-ABL inhibition    47
24           EGFR inhibition    46
creisle commented 2 years ago

@cgrisdale can you double check my mappings make sense? I have included a summary of them in the README.md (see PR #62)

On the last test load I did against graphkbdev there was a success rate of 717/820 records (87%)

The most common errors were:

      2 error: Failed to create the combination therapy (Neoadjuvant chemotherapy + surgery)
      2 Error: missing Disease record where {"AND":[{"name":"Fallopian Tube"},{"sourceId":"HGSFT"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
      2 Error: missing Disease record where {"AND":[{"name":"Metastatic Breast Cancer"},{"sourceId":"MBC"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
      2 Error: missing Therapy record where {"OR":[{"sourceId":"GANT61"},{"name":"GANT61"}]}
      2 Error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemoradiation"},{"name":"Neoadjuvant chemoradiation"}]}
      2 Error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemotherapy"},{"name":"Neoadjuvant chemotherapy"}]}
      2 error: missing Therapy record where {"OR":[{"sourceId":"Neoadjuvant chemotherapy"},{"name":"Neoadjuvant chemotherapy"}]}
      3 Error: Spec Validation failed for undefined #.features[0].attributes[0].feature_type should be equal to constant found '["somatic_variant"]'
      4 Error: statement has no relevance
      4 Error: unsupported notation type: 'GT>AA'
      5 Error: missing Disease record where {"AND":[{"name":"Glioma"},{"sourceId":"GNOS"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
      5 Error: Unexpected variant configuration: mutational_burden
      6 Error: missing Disease record where {"AND":[{"name":"Renal Clear Cell Carcinoma"},{"sourceId":"RCC"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}
      7 Error: Spec Validation failed for undefined #.features[0].attributes[0].rearrangement_type should be equal to one of the allowed values found '["null"]'
     20 Error: disease not given
     20 Error: missing Disease record where {"AND":[{"name":"Any solid tumor"},{"source":{"filters":{"name":"oncotree"},"target":"Source"}}]}

for the most part these are missing disease/drugs we can deal with like we do the civic ones. F

For this one

      5 Error: Unexpected variant configuration: mutational_burden

I am not sure what to do with the mutational_burden variants for now so I have left them out. That is because they don't specific high/low as far as i can tell currently

      7 Error: Spec Validation failed for undefined #.features[0].attributes[0].rearrangement_type should be equal to one of the allowed values found '["null"]'

This one is the result of missing information, not sure if we want to load just rearrangement types without any further specification?

      3 Error: Spec Validation failed for undefined #.features[0].attributes[0].feature_type should be equal to constant found '["somatic_variant"]'

This one is the result of errors in their data