cBioPortal / datahub

A centralized location for storing curated data from cBioPortal
168 stars 119 forks source link

Normalizing ncbi_build in MAF #571

Closed jjgao closed 4 years ago

jjgao commented 5 years ago

In the public portal database, there are a few different values for NCBI_BUILD. But we should only have one version (GRCh37/hg19) in our database. Let's:

Note: it's related to multi-genome support too: https://github.com/cBioPortal/cbioportal/issues/5652.

notifying @rmadupuri @ritikakundra @yichaoS @sandertan @pieterlukasse @n1zea144 @khzhu

ncbi_build count(distinct gp.stable_id) group_concat(distinct gp.stable_id)
36 1 sarc_mskcc_mutations
37 113 all_stjude_2013_mutations,all_stjude_2016_mutations,aml_target_2018_pub_mutations,ampca_bcm_2016_mutations,angs_project_painter_2018_mutations,...
GRCh37 237 acbc_mskcc_2015_mutations,acc_tcga_mutations,acc_tcga_pan_can_atlas_2018_mutations,...
hg19 27 acyc_sanger_2013_mutations,all_stjude_2015_mutations,blca_nmibc_2017_mutations,...
NA 86 acbc_mskcc_2015_mutations,acc_tcga_pan_can_atlas_2018_mutations,acyc_mda_2015_mutations,...
select ncbi_build, group_concat(distinct gp.stable_id)
from mutation_event e, mutation m, genetic_profile gp
where e.mutation_event_id=m.mutation_event_id
and m.genetic_profile_id=gp.genetic_profile_id
group by ncbi_build
fedde-s commented 5 years ago

For reference: the current validation rules specifically allow a study to pass down the loading pipeline if all mutations that would be loaded have one of these formats:

Or de corresponding strings for the (human or mouse) genome configured in a custom cBioPortal installation's portal.properties file, if applicable.

fedde-s commented 5 years ago

Correction: they also allow blank genome fields and files that lack the column altogether, which will presumably end up being represented in the database as NA.

yichaoS commented 5 years ago

@jj sarc_mskcc has been lacking of coordinates (and the MAF therefore is not presented in datahub at all), and we've been waiting for Barry to get back to us. Without that, I don't think we can do a liftover?

JJ commented 5 years ago

I keep unsubscribing myself from these issues, but really, I would appreciate if you stopped including me in them...

yichaoS commented 5 years ago

I'm so sorry @JJ !! The right @jjgao ^

rmadupuri commented 5 years ago

@jjgao from what I understand, if the NCBI_Build is 37 or hg19 instead of GRCh37 it means that the corresponding rows didn't get annotations(HGVSp_Short is empty). If they get annotations the annotator replaces it with GRCh37.

We have recently re-annotated all the public studies but they were not pushed to production yet since the public-rebuild is happening soon.

Here's the update from the re-annotated files:

NCBI_Build Count(studies) Studies_list
37 31 acc_tcga,blca_bgi,blca_mskcc_solit_2012,cesc_tcga,chol_nus_2012,chol_tcga,cll_iuopa_2015,coadread_tcga,dlbc_broad_2012,es_dfarber_broad_2014,esca_tcga,laml_tcga,lcll_broad_2013,lusc_tcga,mcl_idibips_2013,mixed_allen_2018,mpnst_mskcc,ov_tcga,paad_icgc,paad_qcmg_uq_2016,panet_arcnet_2017,prad_fhcrc,prad_mskcc_cheny1_organoids_2014,skcm_ucla_2016,skcm_yale,stad_tcga,stad_uhongkong,stes_tcga_pub,tgct_tcga,thca_tcga,thym_tcga
hg19 6 brca_igr_2015,brca_sanger,ctcl_columbia_2015,hcc_inserm_fr_2015,lihc_amc_prv,pact_jhu_2011
NA 2 lgggbm_tcga_pub,urcc_mskcc_2016
sandertan commented 5 years ago

@jjgao the latest MAF format file specifications (currently not available anymore, but GDC says it will be added back soon), states that "hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37," are valid in this field.

But I think in practice, everyone uses GRCh37 and GRCh38, and perhaps GRCm38 for mouse.

We could either block non GRCh37, GRCh38, GRCm38 values in the validator, or convert all other values to these values in the importer.

rmadupuri commented 5 years ago

Normalized build (hg19/37/NA) to GRCh37 in files from all public studies

jjgao commented 5 years ago

@sandertan maybe add it as a warning?

For us, we should normalize them into the same value, definitely in database, and probably in files too.

pieterlukasse commented 5 years ago

@jjgao I'm in favor of having a strict validation at some point, for both human and mouse. Also being discussed in https://github.com/cBioPortal/cbioportal/pull/5891