Closed jjgao closed 3 years ago
@rmadupuri it would also be useful to add a rule in the validator.
@ritikakundra @rmadupuri Query result update (cgds_public)
stes_tcga_pub_mutations 290 stad_tcga_mutations 234 cellline_ccle_broad_mutations 191 brca_tcga_pub2015_mutations 184 sclc_cancercell_gardner_2017_mutations 159 ucec_tcga_mutations 128 ucec_tcga_pub_mutations 128 stad_tcga_pan_can_atlas_2018_mutations 127 brca_tcga_mutations 109 kirc_tcga_mutations 105 kirc_tcga_pub_mutations 102 hnsc_tcga_mutations 95 esca_tcga_mutations 66 acc_tcga_mutations 62 ucec_tcga_pan_can_atlas_2018_mutations 62 laml_tcga_mutations 57 laml_tcga_pan_can_atlas_2018_mutations 57 laml_tcga_pub_mutations 57 msk_impact_2017_mutations 53 brca_tcga_pub_mutations 50 brca_tcga_pan_can_atlas_2018_mutations 47 prad_tcga_mutations 47 hnsc_tcga_pub_mutations 46 coadread_tcga_mutations 39 coadread_tcga_pub_mutations 39 ov_tcga_mutations 37 hnsc_tcga_pan_can_atlas_2018_mutations 34 ov_tcga_pub_mutations 33 esca_tcga_pan_can_atlas_2018_mutations 33 blca_tcga_mutations 32 kirc_tcga_pan_can_atlas_2018_mutations 30 blca_tcga_pub_mutations 29 nsclc_tcga_broad_2016_mutations 29 luad_tcga_pub_mutations 27 blca_tcga_pub_2017_mutations 26 skcm_tcga_mutations 25 stad_tcga_pub_mutations 25 luad_tcga_mutations 24 skcm_broad_brafresist_2012_mutations 23 prad_p1000_mutations 21 blca_tcga_pan_can_atlas_2018_mutations 20 coadread_tcga_pan_can_atlas_2018_mutations 18 sarc_tcga_mutations 17 luad_tcga_pan_can_atlas_2018_mutations 16 gbm_tcga_mutations 15 gbm_tcga_pub2013_mutations 15 mel_tsam_liang_2017_mutations 15 pcpg_tcga_mutations 15 kirp_tcga_mutations 12 prad_tcga_pan_can_atlas_2018_mutations 12 lgggbm_tcga_pub_mutations 12 sarc_tcga_pub_mutations 12 lihc_tcga_mutations 12 gbm_tcga_pan_can_atlas_2018_mutations 11 lung_msk_2017_mutations 11 lusc_tcga_mutations 10 lusc_tcga_pub_mutations 10 prad_tcga_pub_mutations 10 dlbc_tcga_mutations 10 summit_2018_mutations 9 ampca_bcm_2016_mutations 9 blca_bgi_mutations 9 sarc_tcga_pan_can_atlas_2018_mutations 9 cesc_tcga_mutations 9 tgct_tcga_mutations 8 hnsc_broad_mutations 7 lgg_tcga_mutations 7 mixed_pipseq_2017_mutations 7 coadread_dfci_2016_mutations 6 lgg_tcga_pan_can_atlas_2018_mutations 6 tmb_mskcc_2018_mutations 6 lihc_amc_prv_mutations 6 skcm_broad_mutations 6 kich_tcga_pub_mutations 6 es_dfarber_broad_2014_mutations 5 tgct_tcga_pan_can_atlas_2018_mutations 5 thca_tcga_mutations 5 lcll_broad_2013_mutations 5 blca_dfarber_mskcc_2014_mutations 5 thca_tcga_pub_mutations 5 pediatric_dkfz_2017_mutations 5 kich_tcga_mutations 5 brca_broad_mutations 5 prad_su2c_2015_mutations 5 ov_tcga_pan_can_atlas_2018_mutations 4 mbn_mdacc_2013_mutations 4 prad_broad_mutations 4 cscc_hgsc_bcm_2014_mutations 4 stad_pfizer_uhongkong_mutations 4 prad_fhcrc_mutations 4 ucs_tcga_mutations 4 ucs_tcga_pan_can_atlas_2018_mutations 4 prad_su2c_2019_mutations 3 wt_target_2018_pub_mutations 3 thym_tcga_mutations 3 mixed_allen_2018_mutations 3 prad_broad_2013_mutations 3 luad_broad_mutations 3 lung_msk_pdx_mutations 2 lusc_tcga_pan_can_atlas_2018_mutations 2 paad_qcmg_uq_2016_mutations 2 breast_msk_2018_mutations 2 mbl_broad_2012_mutations 2 paad_utsw_2015_mutations 2 bcc_unige_2016_mutations 2 thca_tcga_pan_can_atlas_2018_mutations 2 pcnsl_mayo_2015_mutations 2 kich_tcga_pan_can_atlas_2018_mutations 2 prad_eururol_2017_mutations 2 brca_mbcproject_wagle_2017_mutations 2 tet_nci_2014_mutations 1 mbl_sickkids_2016_mutations 1 past_dkfz_heidelberg_2013_mutations 1 hnsc_mdanderson_2013_mutations 1 blca_nmibc_2017_mutations 1 meso_tcga_mutations 1 ucec_msk_2018_mutations 1 lihc_tcga_pan_can_atlas_2018_mutations 1 mpnst_mskcc_mutations 1 desm_broad_2015_mutations 1 kirc_bgi_mutations 1 nbl_broad_2013_mutations 1 prad_mskcc_2017_mutations 1 brca_igr_2015_mutations 1 nsclc_pd1_msk_2018_mutations 1 escc_ucla_2014_mutations 1 vsc_cuk_2018_mutations 1
@yichaoS @rmadupuri if NA should be -, maybe we can do this with a script? @yichaoS can we add Variant classification and Variant type to the result to make sure it is indeed INS or DEL and not a SNP
@ritika Just checked with variant TYPE, they are all INS
(results posted below) Yea, we def can use a script, just to replace NA
or --
to -
in all data files <- @rmadupuri
stes_tcga_pub_mutations INS 290 stad_tcga_mutations INS 234 cellline_ccle_broad_mutations INS 191 brca_tcga_pub2015_mutations INS 184 sclc_cancercell_gardner_2017_mutations INS 159 ucec_tcga_mutations INS 128 ucec_tcga_pub_mutations INS 128 stad_tcga_pan_can_atlas_2018_mutations INS 127 brca_tcga_mutations INS 109 kirc_tcga_mutations INS 105 kirc_tcga_pub_mutations INS 102 hnsc_tcga_mutations INS 95 esca_tcga_mutations INS 66 acc_tcga_mutations INS 62 ucec_tcga_pan_can_atlas_2018_mutations INS 62 laml_tcga_mutations INS 57 laml_tcga_pan_can_atlas_2018_mutations INS 57 laml_tcga_pub_mutations INS 57 msk_impact_2017_mutations INS 53 brca_tcga_pub_mutations INS 50 brca_tcga_pan_can_atlas_2018_mutations INS 47 prad_tcga_mutations INS 47 hnsc_tcga_pub_mutations INS 46 coadread_tcga_pub_mutations INS 39 coadread_tcga_mutations INS 39 ov_tcga_mutations INS 37 hnsc_tcga_pan_can_atlas_2018_mutations INS 34 esca_tcga_pan_can_atlas_2018_mutations INS 33 ov_tcga_pub_mutations INS 33 blca_tcga_mutations INS 32 kirc_tcga_pan_can_atlas_2018_mutations INS 30 blca_tcga_pub_mutations INS 29 nsclc_tcga_broad_2016_mutations INS 29 luad_tcga_pub_mutations INS 27 blca_tcga_pub_2017_mutations INS 26 stad_tcga_pub_mutations INS 25 skcm_tcga_mutations INS 25 luad_tcga_mutations INS 24 skcm_broad_brafresist_2012_mutations INS 23 prad_p1000_mutations INS 21 blca_tcga_pan_can_atlas_2018_mutations INS 20 coadread_tcga_pan_can_atlas_2018_mutations INS 18 sarc_tcga_mutations INS 17 luad_tcga_pan_can_atlas_2018_mutations INS 16 mel_tsam_liang_2017_mutations INS 15 gbm_tcga_pub2013_mutations INS 15 pcpg_tcga_mutations INS 15 gbm_tcga_mutations INS 15 sarc_tcga_pub_mutations INS 12 prad_tcga_pan_can_atlas_2018_mutations INS 12 lihc_tcga_mutations INS 12 kirp_tcga_mutations INS 12 lgggbm_tcga_pub_mutations INS 12 gbm_tcga_pan_can_atlas_2018_mutations INS 11 lung_msk_2017_mutations INS 11 lusc_tcga_pub_mutations INS 10 prad_tcga_pub_mutations INS 10 dlbc_tcga_mutations INS 10 lusc_tcga_mutations INS 10 summit_2018_mutations INS 9 cesc_tcga_mutations INS 9 ampca_bcm_2016_mutations INS 9 blca_bgi_mutations INS 9 sarc_tcga_pan_can_atlas_2018_mutations INS 9 tgct_tcga_mutations INS 8 lgg_tcga_mutations INS 7 hnsc_broad_mutations INS 7 mixed_pipseq_2017_mutations INS 7 kich_tcga_pub_mutations INS 6 coadread_dfci_2016_mutations INS 6 skcm_broad_mutations INS 6 lgg_tcga_pan_can_atlas_2018_mutations INS 6 tmb_mskcc_2018_mutations INS 6 lihc_amc_prv_mutations INS 6 blca_dfarber_mskcc_2014_mutations INS 5 thca_tcga_mutations INS 5 lcll_broad_2013_mutations INS 5 brca_broad_mutations INS 5 prad_su2c_2015_mutations INS 5 thca_tcga_pub_mutations INS 5 kich_tcga_mutations INS 5 es_dfarber_broad_2014_mutations INS 5 tgct_tcga_pan_can_atlas_2018_mutations INS 5 pediatric_dkfz_2017_mutations INS 5 mbn_mdacc_2013_mutations INS 4 ucs_tcga_mutations INS 4 stad_pfizer_uhongkong_mutations INS 4 ucs_tcga_pan_can_atlas_2018_mutations INS 4 prad_broad_mutations INS 4 cscc_hgsc_bcm_2014_mutations INS 4 ov_tcga_pan_can_atlas_2018_mutations INS 4 prad_fhcrc_mutations INS 4 prad_broad_2013_mutations INS 3 luad_broad_mutations INS 3 mixed_allen_2018_mutations INS 3 wt_target_2018_pub_mutations INS 3 prad_su2c_2019_mutations INS 3 thym_tcga_mutations INS 3 paad_utsw_2015_mutations INS 2 lusc_tcga_pan_can_atlas_2018_mutations INS 2 thca_tcga_pan_can_atlas_2018_mutations INS 2 pcnsl_mayo_2015_mutations INS 2 mbl_broad_2012_mutations INS 2 bcc_unige_2016_mutations INS 2 prad_eururol_2017_mutations INS 2 lung_msk_pdx_mutations INS 2 brca_mbcproject_wagle_2017_mutations INS 2 paad_qcmg_uq_2016_mutations INS 2 breast_msk_2018_mutations INS 2 kich_tcga_pan_can_atlas_2018_mutations INS 2 prad_mskcc_2017_mutations INS 1 mpnst_mskcc_mutations INS 1 ucec_msk_2018_mutations INS 1 blca_nmibc_2017_mutations INS 1 past_dkfz_heidelberg_2013_mutations INS 1 kirc_bgi_mutations INS 1 lihc_tcga_pan_can_atlas_2018_mutations INS 1 tet_nci_2014_mutations INS 1 meso_tcga_mutations INS 1 escc_ucla_2014_mutations INS 1 vsc_cuk_2018_mutations INS 1 nbl_broad_2013_mutations INS 1 brca_igr_2015_mutations INS 1 desm_broad_2015_mutations INS 1 nsclc_pd1_msk_2018_mutations INS 1 hnsc_mdanderson_2013_mutations INS 1 mbl_sickkids_2016_mutations INS 1
After @rmadupuri did some checking of the stes_tcga_pub MAF and verified it looks correct (ref allele is '-', not NA), I looked closer at the code -
What is happening is that during import of a MAF record, if a matching mutation event is found in the database, the mutation event is reused (by matching I mean an event with the same entrez, chr, start, stop, protein change, tumor seq allele, and mutation type).
This behavior is by design - mutation events are shared across MAFs. Obviously, one side-effect is that if a mutation event enters into the database that is incorrect, it cannot be updated by fixing a record in a single MAF and reimporting (potentially, it can be linked to MAFs across many studies). Unless all studies that contain the event are deleted and reimported, the mutation event has to be updated in the database directly.
cc: @yichaoS @ritikakundra @jjgao
Thank you @n1zea144. Surveying all the mafs on datahub, none had NA
as Reference Allele for INS variant type. The data files are correct. Since the reimport is not helping, how should we go about this? (Should we fix them directly in the database?) @jjgao @cBioPortal/curation
Updating the database directly is an option if there are a clear set of rules that can be applied - for example, is it true that all Frame_Shift_Ins events should have '-' in the Reference_Allele column?
@n1zea144 @jjgao There are 882 mutation_events in the database where Ref_allele is NA (excluding fusion type). I think we should update the reference_allele of all these to -
. These events have reference_allele as -
on datahub (Compared the database and datahub on the following columns: Entrez_Gene_Id, Chromosome, Start_Position, End_Position, mutation_type, Tumor_Seq_Allele and Protein_change)
Should we use the below logic?
if reference_allele = 'NA' and mutation_type != 'Fusion' from mutation_event:
update reference_allele to '-'
@jjgao @n1zea144 need your comments on the above. Could you suggest if it is correct to update all NA
's to -
for the all Ref_allele's in the mutation_event table?
@yichaoS fixed it in the database. Closing the issue.
We did not observe this issue in the private database for any public study (not sure why)
i see this issue in the private database for some reason at the moment when running the query above
EDIT: it might be good to run some cronjob or a CI test to check whether we don't reintroduce the issue
After the database and private study files have been corrected (verified to be free of NA and '--') then those special cases in the notation converter in genome nexus source code should be removed.
BEFORE CLOSING THIS ISSUE, create an issue in the genome nexus code repository to remove those cases and reference https://github.com/genome-nexus/genome-nexus/pull/466/files
Reference_allele should be
-
but in many studies they are NA or multiple-
.