cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
639 stars 496 forks source link

case list _cnaseq is missing from many studies #3378

Closed jjgao closed 6 years ago

jjgao commented 6 years ago

It's missing from studies e.g. blca_tcga, brca_tcga_pub2015, thca_tcga and more.

We used to auto-generate the case list _cnaseq ("Tumor Samples with sequencing and CNA data") when importing.

jjgao commented 6 years ago

It looks like brca_tcga_pub2015 is missing a lot of case lists.

Can we generate (if not exist) the data-related case lists based on case_lists sheet in portal_importer_configuration for all public studies and check them in?

jjgao commented 6 years ago

@yichaoS @n1zea144 we may want to have another script to update case lists only. Is that possible?

jjgao commented 6 years ago

Another way to address it is to generate them in the API or frontend from either sample_profile or genetic_profile_samples table.

Maybe it's time to link profile to case list as we discussed earlier. @n1zea144

jjgao commented 6 years ago

from @angelicaochoa:

the data files need to be checked that their TCGA sample IDs are in the truncated form. The auto-generated case lists do not truncate the samples loaded from each data file so when the case lists are generated and imported, the importer can't "find" the sample IDs it put in the case list. Example: TCGA-XX-XXXX-01A in data files will be truncate to just TCGA-XX-XXXX-01 BUT the temp case lists generated will try to use TCGA-XX-XXXX-01A, which will not exist in the database and so the sample list will not get populated correctly in the database itself

jjgao commented 6 years ago

THIS IS FIXED.

Did a sql query:

select `STABLE_ID`, NAME
from sample_list l
where l.`LIST_ID` not in (
 select ll.`LIST_ID`
 from sample_list_list ll
)

Here is the list of case list without any samples in the database.

brca_tcga_pub2015_sequenced Sequenced Tumors
brca_tcga_pub2015_rppa  Tumor Samples with RPPA data
gbm_tcga_pub2013_cna    Tumor Samples with CNA data
gbm_tcga_pub2013_rppa   Tumor Samples with RPPA data
prad_tcga_pub_rna_seq_v2_mrna   Tumor Samples with mRNA data (RNA Seq V2)
thca_tcga_pub_sequenced Sequenced Tumors
thca_tcga_pub_rppa  Tumor Samples with RPPA data
jjgao commented 6 years ago

I did some more query to find out studies with both mutations and cna data but not _cnaseq case list:

select cs.`CANCER_STUDY_IDENTIFIER`, cs.`NAME`
from cancer_study cs
where cs.`CANCER_STUDY_ID` in (
 # has mutation profile
 select cs.`CANCER_STUDY_ID`
 from genetic_profile p, cancer_study cs
 where p.`GENETIC_ALTERATION_TYPE`="MUTATION_EXTENDED" and p.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
) and cs.`CANCER_STUDY_ID` in (
 # and cna profile
 select cs.`CANCER_STUDY_ID`
 from genetic_profile p, cancer_study cs
 where p.`GENETIC_ALTERATION_TYPE`="COPY_NUMBER_ALTERATION" and p.`DATATYPE`='DISCRETE' and p.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
) and cs.`CANCER_STUDY_ID` not in (
 # but no _cnaseq case list
 select cs.`CANCER_STUDY_ID`
 from sample_list l, cancer_study cs
 where (l.`STABLE_ID` like "%_cnaseq" OR l.`STABLE_ID` like "%_cna_seq") and l.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
)
order by cs.`CANCER_STUDY_IDENTIFIER`;

Here are the studies:

blca_tcga   Bladder Urothelial Carcinoma (TCGA, Provisional)
coadread_tcga   Colorectal Adenocarcinoma (TCGA, Provisional)
esca_tcga   Esophageal Carcinoma (TCGA, Provisional)
hnsc_tcga   Head and Neck Squamous Cell Carcinoma (TCGA, Provisional)
kich_tcga   Kidney Chromophobe (TCGA, Provisional)
kirp_tcga   Kidney Renal Papillary Cell Carcinoma (TCGA, Provisional)
lihc_tcga   Liver Hepatocellular Carcinoma (TCGA, Provisional)
lusc_tcga   Lung Squamous Cell Carcinoma (TCGA, Provisional)
meso_tcga   Mesothelioma (TCGA, Provisional)
ov_tcga Ovarian Serous Cystadenocarcinoma (TCGA, Provisional)
paad_tcga   Pancreatic Adenocarcinoma (TCGA, Provisional)
pcpg_tcga   Pheochromocytoma and Paraganglioma (TCGA, Provisional)
prad_tcga   Prostate Adenocarcinoma (TCGA, Provisional)
sarc_tcga   Sarcoma (TCGA, Provisional)
skcm_tcga   Skin Cutaneous Melanoma (TCGA, Provisional)
stad_tcga   Stomach Adenocarcinoma (TCGA, Provisional)
stad_tcga_pub   Stomach Adenocarcinoma (TCGA, Nature 2014)
tgct_tcga   Testicular Germ Cell Cancer (TCGA, Provisional)
thca_tcga   Thyroid Carcinoma (TCGA, Provisional)
thym_tcga   Thymoma (TCGA, Provisional)
ucec_tcga   Uterine Corpus Endometrial Carcinoma (TCGA, Provisional)
ucs_tcga    Uterine Carcinosarcoma (TCGA, Provisional)
uvm_tcga    Uveal Melanoma (TCGA, Provisional)

Also noticed that the following studies has case list of _cna_seq instead of _cnaseq. We should make them consistent. @ritikakundra logged here: https://github.com/cBioPortal/datahub/issues/116

acyc_mskcc_2013
blca_mskcc_solit_2012
coadread_tcga_pub
luad_broad
ov_tcga_pub
prad_broad
prad_broad_2013
prad_mich
prad_mskcc
sarc_mskcc
jjgao commented 6 years ago

Is there any validation code to catch missing case lists? @pieterlukasse

ao508 commented 6 years ago

I am going to make a checklist here for resolved missing cnaseq case lists:

ao508 commented 6 years ago

@jjgao there are some studies listed above that do not have overlapping sample ids in cases_sequenced and cases_cna. I'll update this list as I go along.

sandertan commented 6 years ago

@jjgao I believe there's currently no validation for missing specific case lists, except for the _all case list: ERROR: -: No case list found for stable_id 'teststudy_all', consider adding 'add_global_case_list: true' to the study metadata file

Could we also document the required case lists in File Formats and align it with the default selections on the query page?:

https://github.com/cBioPortal/cbioportal-frontend/blob/b6577560fa2a07e8a6d4b5ea881327283dcccb70/src/shared/components/query/QueryStore.ts#L598

sandertan commented 6 years ago

Also, are the categories such as all_cases_with_cna_data still used? The documentation is a bit vague about it.

jjgao commented 6 years ago

@angelicaochoa re: no overlapping sample ids in cases_sequenced and cases_cna for ov_tcga, there must be something wrong in the case lists. This query shows mutations and CNAs on the same samples.

jjgao commented 6 years ago

@sandertan: I am thinking of re-implementing case lists: https://docs.google.com/document/d/1aBbkTAFv5nCqBv66BOgwvlt5kymgt0lQ7l_pqaFrupc/edit?usp=sharing

ao508 commented 6 years ago

@jjgao missing cnaseq case lists are resolved. The only study that did not actually have overlapping CNA and Mutations samples is ov_tcga.

jjgao commented 6 years ago

Thanks, @angelicaochoa!

But in this query (http://www.cbioportal.org/index.do?session_id=5a1eb215498eb8b3d560ef6a) (all ov_tcga samples seleted), there are a lot of samples with both mutations and CNAs. Could you double check if cases_sequenced and cases_cna were generated correctly?

image

jjgao commented 6 years ago

@angelicaochoa I tried two queries, one for sequence tumors (http://www.cbioportal.org/index.do?session_id=5a1eb2af498eb8b3d560ef77), and the other for cna tumors (http://www.cbioportal.org/index.do?session_id=5a1eb2bd498eb8b3d560ef79), and looked at the Download tab. There are a lot of overlapped samples between the two queries.

jjgao commented 6 years ago

@angelicaochoa did you have a chance to look into the ovarian study? There are definitely many overlapping samples between the mutations and cna profiles.

ao508 commented 6 years ago

@jjgao there might have been an issue where the case list sample IDs were not in standardized formats? that should be addressed now however with the latest datahub updates to the provisional TCGA studies

ao508 commented 6 years ago

@yichaoS @ritikakundra were TCGA provisional updated in the public portal yet?

ao508 commented 6 years ago

@jjgao There might have been an issue where the TCGA sample IDs were not in the standardized/truncated format but I think this should be resolved in the latest TCGA provisional updates to datahub.

@yichaoS @ritikakundra can you confirm whether the missing cnaseq case list for ov_tcga is resolved?