cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
171 stars 119 forks source link

reference_allele is NA for some INS #621

Closed jjgao closed 3 years ago

jjgao commented 5 years ago

Reference_allele should be - but in many studies they are NA or multiple -.

select gp.stable_id, count(*) mut_count
from mutation_event me, mutation m, genetic_profile gp
where me.mutation_event_id=m.mutation_event_id
and m.genetic_profile_id=gp.genetic_profile_id
and (reference_allele = "NA" or reference_allele like "--%")
and mutation_type <> "fusion"
group by gp.stable_id
order by mut_count desc;
stable_id   mut_count
stes_tcga_pub_mutations 290
stad_tcga_mutations 234
cellline_ccle_broad_mutations   191
brca_tcga_pub2015_mutations 184
sclc_cancercell_gardner_2017_mutations  159
ucec_tcga_mutations 128
ucec_tcga_pub_mutations 128
stad_tcga_pan_can_atlas_2018_mutations  127
brca_tcga_mutations 109
kirc_tcga_mutations 105
kirc_tcga_pub_mutations 102
sclc_ucologne_2015_mutations    98
hnsc_tcga_mutations 95
esca_tcga_mutations 66
ucec_tcga_pan_can_atlas_2018_mutations  62
acc_tcga_mutations  62
laml_tcga_mutations 57
laml_tcga_pan_can_atlas_2018_mutations  57
laml_tcga_pub_mutations 57
msk_impact_2017_mutations   53
brca_tcga_pub_mutations 50
prad_tcga_mutations 47
brca_tcga_pan_can_atlas_2018_mutations  47
hnsc_tcga_pub_mutations 46
sarc_mskcc_mutations    45
coadread_tcga_mutations 39
coadread_tcga_pub_mutations 39
ov_tcga_mutations   37
lusc_tcga_pub_mutations 37
hnsc_tcga_pan_can_atlas_2018_mutations  34
esca_tcga_pan_can_atlas_2018_mutations  33
ov_tcga_pub_mutations   33
blca_tcga_mutations 32
kirc_tcga_pan_can_atlas_2018_mutations  30
nsclc_tcga_broad_2016_mutations 29
blca_tcga_pub_mutations 28
luad_tcga_pub_mutations 27
blca_tcga_pub_2017_mutations    26
skcm_tcga_mutations 25
stad_tcga_pub_mutations 25
luad_tcga_mutations 24
skcm_broad_brafresist_2012_mutations    23
panet_arcnet_2017_mutations 22
blca_tcga_pan_can_atlas_2018_mutations  20
prad_p1000_mutations    20
coadread_tcga_pan_can_atlas_2018_mutations  18
sarc_tcga_mutations 17
luad_tcga_pan_can_atlas_2018_mutations  16
mel_tsam_liang_2017_mutations   15
gbm_tcga_mutations  15
gbm_tcga_pub2013_mutations  15
pcpg_tcga_mutations 15
lihc_tcga_mutations 12
prad_tcga_pan_can_atlas_2018_mutations  12
kirp_tcga_mutations 12
lgggbm_tcga_pub_mutations   12
lung_msk_2017_mutations 11
gbm_tcga_pan_can_atlas_2018_mutations   11
dlbc_tcga_mutations 10
prad_tcga_pub_mutations 10
lusc_tcga_mutations 10
ampca_bcm_2016_mutations    9
cesc_tcga_mutations 9
blca_bgi_mutations  9
summit_2018_mutations   9
sarc_tcga_pan_can_atlas_2018_mutations  9
tgct_tcga_mutations 8
lgg_tcga_mutations  7
hnsc_broad_mutations    7
tmb_mskcc_2018_mutations    6
pediatric_dkfz_2017_mutations   6
lgg_tcga_pan_can_atlas_2018_mutations   6
skcm_broad_mutations    6
coadread_dfci_2016_mutations    6
lihc_amc_prv_mutations  6
mixed_pipseq_2017_mutations 6
kich_tcga_pub_mutations 6
blca_dfarber_mskcc_2014_mutations   5
kich_tcga_mutations 5
prad_su2c_2015_mutations    5
brca_broad_mutations    5
es_dfarber_broad_2014_mutations 5
tgct_tcga_pan_can_atlas_2018_mutations  5
thca_tcga_mutations 5
lcll_broad_2013_mutations   5
thca_tcga_pub_mutations 5
prad_broad_mutations    4
stad_pfizer_uhongkong_mutations 4
prad_fhcrc_mutations    4
ucs_tcga_mutations  4
ucs_tcga_pan_can_atlas_2018_mutations   4
ov_tcga_pan_can_atlas_2018_mutations    4
prad_broad_2013_mutations   3
luad_broad_mutations    3
wt_target_2018_pub_mutations    3
thym_tcga_mutations 3
prad_eururol_2017_mutations 2
kich_tcga_pan_can_atlas_2018_mutations  2
lung_msk_pdx_mutations  2
paad_utsw_2015_mutations    2
lusc_tcga_pan_can_atlas_2018_mutations  2
thca_tcga_pan_can_atlas_2018_mutations  2
mbl_broad_2012_mutations    2
pcnsl_mayo_2015_mutations   2
breast_msk_2018_mutations   2
ucec_msk_2018_mutations 1
meso_tcga_mutations 1
mpnst_mskcc_mutations   1
blca_nmibc_2017_mutations   1
lihc_tcga_pan_can_atlas_2018_mutations  1
nbl_broad_2013_mutations    1
prad_mskcc_2017_mutations   1
desm_broad_2015_mutations   1
kirc_bgi_mutations  1
nsclc_pd1_msk_2018_mutations    1
vsc_cuk_2018_mutations  1
brca_igr_2015_mutations 1
tet_nci_2014_mutations  1
escc_ucla_2014_mutations    1
past_dkfz_heidelberg_2013_mutations 1
hnsc_mdanderson_2013_mutations  1
mbl_sickkids_2016_mutations 1
jjgao commented 5 years ago

@rmadupuri it would also be useful to add a rule in the validator.

yichaoS commented 5 years ago

@ritikakundra @rmadupuri Query result update (cgds_public)

stes_tcga_pub_mutations 290 stad_tcga_mutations 234 cellline_ccle_broad_mutations 191 brca_tcga_pub2015_mutations 184 sclc_cancercell_gardner_2017_mutations 159 ucec_tcga_mutations 128 ucec_tcga_pub_mutations 128 stad_tcga_pan_can_atlas_2018_mutations 127 brca_tcga_mutations 109 kirc_tcga_mutations 105 kirc_tcga_pub_mutations 102 hnsc_tcga_mutations 95 esca_tcga_mutations 66 acc_tcga_mutations 62 ucec_tcga_pan_can_atlas_2018_mutations 62 laml_tcga_mutations 57 laml_tcga_pan_can_atlas_2018_mutations 57 laml_tcga_pub_mutations 57 msk_impact_2017_mutations 53 brca_tcga_pub_mutations 50 brca_tcga_pan_can_atlas_2018_mutations 47 prad_tcga_mutations 47 hnsc_tcga_pub_mutations 46 coadread_tcga_mutations 39 coadread_tcga_pub_mutations 39 ov_tcga_mutations 37 hnsc_tcga_pan_can_atlas_2018_mutations 34 ov_tcga_pub_mutations 33 esca_tcga_pan_can_atlas_2018_mutations 33 blca_tcga_mutations 32 kirc_tcga_pan_can_atlas_2018_mutations 30 blca_tcga_pub_mutations 29 nsclc_tcga_broad_2016_mutations 29 luad_tcga_pub_mutations 27 blca_tcga_pub_2017_mutations 26 skcm_tcga_mutations 25 stad_tcga_pub_mutations 25 luad_tcga_mutations 24 skcm_broad_brafresist_2012_mutations 23 prad_p1000_mutations 21 blca_tcga_pan_can_atlas_2018_mutations 20 coadread_tcga_pan_can_atlas_2018_mutations 18 sarc_tcga_mutations 17 luad_tcga_pan_can_atlas_2018_mutations 16 gbm_tcga_mutations 15 gbm_tcga_pub2013_mutations 15 mel_tsam_liang_2017_mutations 15 pcpg_tcga_mutations 15 kirp_tcga_mutations 12 prad_tcga_pan_can_atlas_2018_mutations 12 lgggbm_tcga_pub_mutations 12 sarc_tcga_pub_mutations 12 lihc_tcga_mutations 12 gbm_tcga_pan_can_atlas_2018_mutations 11 lung_msk_2017_mutations 11 lusc_tcga_mutations 10 lusc_tcga_pub_mutations 10 prad_tcga_pub_mutations 10 dlbc_tcga_mutations 10 summit_2018_mutations 9 ampca_bcm_2016_mutations 9 blca_bgi_mutations 9 sarc_tcga_pan_can_atlas_2018_mutations 9 cesc_tcga_mutations 9 tgct_tcga_mutations 8 hnsc_broad_mutations 7 lgg_tcga_mutations 7 mixed_pipseq_2017_mutations 7 coadread_dfci_2016_mutations 6 lgg_tcga_pan_can_atlas_2018_mutations 6 tmb_mskcc_2018_mutations 6 lihc_amc_prv_mutations 6 skcm_broad_mutations 6 kich_tcga_pub_mutations 6 es_dfarber_broad_2014_mutations 5 tgct_tcga_pan_can_atlas_2018_mutations 5 thca_tcga_mutations 5 lcll_broad_2013_mutations 5 blca_dfarber_mskcc_2014_mutations 5 thca_tcga_pub_mutations 5 pediatric_dkfz_2017_mutations 5 kich_tcga_mutations 5 brca_broad_mutations 5 prad_su2c_2015_mutations 5 ov_tcga_pan_can_atlas_2018_mutations 4 mbn_mdacc_2013_mutations 4 prad_broad_mutations 4 cscc_hgsc_bcm_2014_mutations 4 stad_pfizer_uhongkong_mutations 4 prad_fhcrc_mutations 4 ucs_tcga_mutations 4 ucs_tcga_pan_can_atlas_2018_mutations 4 prad_su2c_2019_mutations 3 wt_target_2018_pub_mutations 3 thym_tcga_mutations 3 mixed_allen_2018_mutations 3 prad_broad_2013_mutations 3 luad_broad_mutations 3 lung_msk_pdx_mutations 2 lusc_tcga_pan_can_atlas_2018_mutations 2 paad_qcmg_uq_2016_mutations 2 breast_msk_2018_mutations 2 mbl_broad_2012_mutations 2 paad_utsw_2015_mutations 2 bcc_unige_2016_mutations 2 thca_tcga_pan_can_atlas_2018_mutations 2 pcnsl_mayo_2015_mutations 2 kich_tcga_pan_can_atlas_2018_mutations 2 prad_eururol_2017_mutations 2 brca_mbcproject_wagle_2017_mutations 2 tet_nci_2014_mutations 1 mbl_sickkids_2016_mutations 1 past_dkfz_heidelberg_2013_mutations 1 hnsc_mdanderson_2013_mutations 1 blca_nmibc_2017_mutations 1 meso_tcga_mutations 1 ucec_msk_2018_mutations 1 lihc_tcga_pan_can_atlas_2018_mutations 1 mpnst_mskcc_mutations 1 desm_broad_2015_mutations 1 kirc_bgi_mutations 1 nbl_broad_2013_mutations 1 prad_mskcc_2017_mutations 1 brca_igr_2015_mutations 1 nsclc_pd1_msk_2018_mutations 1 escc_ucla_2014_mutations 1 vsc_cuk_2018_mutations 1

ritikakundra commented 5 years ago

@yichaoS @rmadupuri if NA should be -, maybe we can do this with a script? @yichaoS can we add Variant classification and Variant type to the result to make sure it is indeed INS or DEL and not a SNP

yichaoS commented 5 years ago

@ritika Just checked with variant TYPE, they are all INS (results posted below) Yea, we def can use a script, just to replace NA or -- to - in all data files <- @rmadupuri

stes_tcga_pub_mutations INS 290 stad_tcga_mutations INS 234 cellline_ccle_broad_mutations INS 191 brca_tcga_pub2015_mutations INS 184 sclc_cancercell_gardner_2017_mutations INS 159 ucec_tcga_mutations INS 128 ucec_tcga_pub_mutations INS 128 stad_tcga_pan_can_atlas_2018_mutations INS 127 brca_tcga_mutations INS 109 kirc_tcga_mutations INS 105 kirc_tcga_pub_mutations INS 102 hnsc_tcga_mutations INS 95 esca_tcga_mutations INS 66 acc_tcga_mutations INS 62 ucec_tcga_pan_can_atlas_2018_mutations INS 62 laml_tcga_mutations INS 57 laml_tcga_pan_can_atlas_2018_mutations INS 57 laml_tcga_pub_mutations INS 57 msk_impact_2017_mutations INS 53 brca_tcga_pub_mutations INS 50 brca_tcga_pan_can_atlas_2018_mutations INS 47 prad_tcga_mutations INS 47 hnsc_tcga_pub_mutations INS 46 coadread_tcga_pub_mutations INS 39 coadread_tcga_mutations INS 39 ov_tcga_mutations INS 37 hnsc_tcga_pan_can_atlas_2018_mutations INS 34 esca_tcga_pan_can_atlas_2018_mutations INS 33 ov_tcga_pub_mutations INS 33 blca_tcga_mutations INS 32 kirc_tcga_pan_can_atlas_2018_mutations INS 30 blca_tcga_pub_mutations INS 29 nsclc_tcga_broad_2016_mutations INS 29 luad_tcga_pub_mutations INS 27 blca_tcga_pub_2017_mutations INS 26 stad_tcga_pub_mutations INS 25 skcm_tcga_mutations INS 25 luad_tcga_mutations INS 24 skcm_broad_brafresist_2012_mutations INS 23 prad_p1000_mutations INS 21 blca_tcga_pan_can_atlas_2018_mutations INS 20 coadread_tcga_pan_can_atlas_2018_mutations INS 18 sarc_tcga_mutations INS 17 luad_tcga_pan_can_atlas_2018_mutations INS 16 mel_tsam_liang_2017_mutations INS 15 gbm_tcga_pub2013_mutations INS 15 pcpg_tcga_mutations INS 15 gbm_tcga_mutations INS 15 sarc_tcga_pub_mutations INS 12 prad_tcga_pan_can_atlas_2018_mutations INS 12 lihc_tcga_mutations INS 12 kirp_tcga_mutations INS 12 lgggbm_tcga_pub_mutations INS 12 gbm_tcga_pan_can_atlas_2018_mutations INS 11 lung_msk_2017_mutations INS 11 lusc_tcga_pub_mutations INS 10 prad_tcga_pub_mutations INS 10 dlbc_tcga_mutations INS 10 lusc_tcga_mutations INS 10 summit_2018_mutations INS 9 cesc_tcga_mutations INS 9 ampca_bcm_2016_mutations INS 9 blca_bgi_mutations INS 9 sarc_tcga_pan_can_atlas_2018_mutations INS 9 tgct_tcga_mutations INS 8 lgg_tcga_mutations INS 7 hnsc_broad_mutations INS 7 mixed_pipseq_2017_mutations INS 7 kich_tcga_pub_mutations INS 6 coadread_dfci_2016_mutations INS 6 skcm_broad_mutations INS 6 lgg_tcga_pan_can_atlas_2018_mutations INS 6 tmb_mskcc_2018_mutations INS 6 lihc_amc_prv_mutations INS 6 blca_dfarber_mskcc_2014_mutations INS 5 thca_tcga_mutations INS 5 lcll_broad_2013_mutations INS 5 brca_broad_mutations INS 5 prad_su2c_2015_mutations INS 5 thca_tcga_pub_mutations INS 5 kich_tcga_mutations INS 5 es_dfarber_broad_2014_mutations INS 5 tgct_tcga_pan_can_atlas_2018_mutations INS 5 pediatric_dkfz_2017_mutations INS 5 mbn_mdacc_2013_mutations INS 4 ucs_tcga_mutations INS 4 stad_pfizer_uhongkong_mutations INS 4 ucs_tcga_pan_can_atlas_2018_mutations INS 4 prad_broad_mutations INS 4 cscc_hgsc_bcm_2014_mutations INS 4 ov_tcga_pan_can_atlas_2018_mutations INS 4 prad_fhcrc_mutations INS 4 prad_broad_2013_mutations INS 3 luad_broad_mutations INS 3 mixed_allen_2018_mutations INS 3 wt_target_2018_pub_mutations INS 3 prad_su2c_2019_mutations INS 3 thym_tcga_mutations INS 3 paad_utsw_2015_mutations INS 2 lusc_tcga_pan_can_atlas_2018_mutations INS 2 thca_tcga_pan_can_atlas_2018_mutations INS 2 pcnsl_mayo_2015_mutations INS 2 mbl_broad_2012_mutations INS 2 bcc_unige_2016_mutations INS 2 prad_eururol_2017_mutations INS 2 lung_msk_pdx_mutations INS 2 brca_mbcproject_wagle_2017_mutations INS 2 paad_qcmg_uq_2016_mutations INS 2 breast_msk_2018_mutations INS 2 kich_tcga_pan_can_atlas_2018_mutations INS 2 prad_mskcc_2017_mutations INS 1 mpnst_mskcc_mutations INS 1 ucec_msk_2018_mutations INS 1 blca_nmibc_2017_mutations INS 1 past_dkfz_heidelberg_2013_mutations INS 1 kirc_bgi_mutations INS 1 lihc_tcga_pan_can_atlas_2018_mutations INS 1 tet_nci_2014_mutations INS 1 meso_tcga_mutations INS 1 escc_ucla_2014_mutations INS 1 vsc_cuk_2018_mutations INS 1 nbl_broad_2013_mutations INS 1 brca_igr_2015_mutations INS 1 desm_broad_2015_mutations INS 1 nsclc_pd1_msk_2018_mutations INS 1 hnsc_mdanderson_2013_mutations INS 1 mbl_sickkids_2016_mutations INS 1

n1zea144 commented 4 years ago

After @rmadupuri did some checking of the stes_tcga_pub MAF and verified it looks correct (ref allele is '-', not NA), I looked closer at the code -

What is happening is that during import of a MAF record, if a matching mutation event is found in the database, the mutation event is reused (by matching I mean an event with the same entrez, chr, start, stop, protein change, tumor seq allele, and mutation type).

This behavior is by design - mutation events are shared across MAFs. Obviously, one side-effect is that if a mutation event enters into the database that is incorrect, it cannot be updated by fixing a record in a single MAF and reimporting (potentially, it can be linked to MAFs across many studies). Unless all studies that contain the event are deleted and reimported, the mutation event has to be updated in the database directly.

cc: @yichaoS @ritikakundra @jjgao

rmadupuri commented 4 years ago

Thank you @n1zea144. Surveying all the mafs on datahub, none had NA as Reference Allele for INS variant type. The data files are correct. Since the reimport is not helping, how should we go about this? (Should we fix them directly in the database?) @jjgao @cBioPortal/curation

n1zea144 commented 4 years ago

Updating the database directly is an option if there are a clear set of rules that can be applied - for example, is it true that all Frame_Shift_Ins events should have '-' in the Reference_Allele column?

rmadupuri commented 4 years ago

@n1zea144 @jjgao There are 882 mutation_events in the database where Ref_allele is NA (excluding fusion type). I think we should update the reference_allele of all these to -. These events have reference_allele as - on datahub (Compared the database and datahub on the following columns: Entrez_Gene_Id, Chromosome, Start_Position, End_Position, mutation_type, Tumor_Seq_Allele and Protein_change)

Should we use the below logic?

if reference_allele = 'NA' and mutation_type != 'Fusion' from mutation_event:
  update reference_allele to '-'
rmadupuri commented 4 years ago

@jjgao @n1zea144 need your comments on the above. Could you suggest if it is correct to update all NA's to - for the all Ref_allele's in the mutation_event table?

rmadupuri commented 4 years ago

@yichaoS fixed it in the database. Closing the issue.

We did not observe this issue in the private database for any public study (not sure why)

inodb commented 3 years ago

i see this issue in the private database for some reason at the moment when running the query above

EDIT: it might be good to run some cronjob or a CI test to check whether we don't reintroduce the issue

sheridancbio commented 3 years ago

After the database and private study files have been corrected (verified to be free of NA and '--') then those special cases in the notation converter in genome nexus source code should be removed.

BEFORE CLOSING THIS ISSUE, create an issue in the genome nexus code repository to remove those cases and reference https://github.com/genome-nexus/genome-nexus/pull/466/files