Closed jjgao closed 6 years ago
It looks like brca_tcga_pub2015 is missing a lot of case lists.
Can we generate (if not exist) the data-related case lists based on case_lists sheet in portal_importer_configuration for all public studies and check them in?
@yichaoS @n1zea144 we may want to have another script to update case lists only. Is that possible?
Another way to address it is to generate them in the API or frontend from either sample_profile
or genetic_profile_samples
table.
Maybe it's time to link profile to case list as we discussed earlier. @n1zea144
from @angelicaochoa:
the data files need to be checked that their TCGA sample IDs are in the truncated form. The auto-generated case lists do not truncate the samples loaded from each data file so when the case lists are generated and imported, the importer can't "find" the sample IDs it put in the case list. Example: TCGA-XX-XXXX-01A in data files will be truncate to just TCGA-XX-XXXX-01 BUT the temp case lists generated will try to use TCGA-XX-XXXX-01A, which will not exist in the database and so the sample list will not get populated correctly in the database itself
THIS IS FIXED.
Did a sql query:
select `STABLE_ID`, NAME
from sample_list l
where l.`LIST_ID` not in (
select ll.`LIST_ID`
from sample_list_list ll
)
Here is the list of case list without any samples in the database.
brca_tcga_pub2015_sequenced Sequenced Tumors
brca_tcga_pub2015_rppa Tumor Samples with RPPA data
gbm_tcga_pub2013_cna Tumor Samples with CNA data
gbm_tcga_pub2013_rppa Tumor Samples with RPPA data
prad_tcga_pub_rna_seq_v2_mrna Tumor Samples with mRNA data (RNA Seq V2)
thca_tcga_pub_sequenced Sequenced Tumors
thca_tcga_pub_rppa Tumor Samples with RPPA data
I did some more query to find out studies with both mutations and cna data but not _cnaseq case list:
select cs.`CANCER_STUDY_IDENTIFIER`, cs.`NAME`
from cancer_study cs
where cs.`CANCER_STUDY_ID` in (
# has mutation profile
select cs.`CANCER_STUDY_ID`
from genetic_profile p, cancer_study cs
where p.`GENETIC_ALTERATION_TYPE`="MUTATION_EXTENDED" and p.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
) and cs.`CANCER_STUDY_ID` in (
# and cna profile
select cs.`CANCER_STUDY_ID`
from genetic_profile p, cancer_study cs
where p.`GENETIC_ALTERATION_TYPE`="COPY_NUMBER_ALTERATION" and p.`DATATYPE`='DISCRETE' and p.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
) and cs.`CANCER_STUDY_ID` not in (
# but no _cnaseq case list
select cs.`CANCER_STUDY_ID`
from sample_list l, cancer_study cs
where (l.`STABLE_ID` like "%_cnaseq" OR l.`STABLE_ID` like "%_cna_seq") and l.`CANCER_STUDY_ID`=cs.`CANCER_STUDY_ID`
)
order by cs.`CANCER_STUDY_IDENTIFIER`;
Here are the studies:
blca_tcga Bladder Urothelial Carcinoma (TCGA, Provisional)
coadread_tcga Colorectal Adenocarcinoma (TCGA, Provisional)
esca_tcga Esophageal Carcinoma (TCGA, Provisional)
hnsc_tcga Head and Neck Squamous Cell Carcinoma (TCGA, Provisional)
kich_tcga Kidney Chromophobe (TCGA, Provisional)
kirp_tcga Kidney Renal Papillary Cell Carcinoma (TCGA, Provisional)
lihc_tcga Liver Hepatocellular Carcinoma (TCGA, Provisional)
lusc_tcga Lung Squamous Cell Carcinoma (TCGA, Provisional)
meso_tcga Mesothelioma (TCGA, Provisional)
ov_tcga Ovarian Serous Cystadenocarcinoma (TCGA, Provisional)
paad_tcga Pancreatic Adenocarcinoma (TCGA, Provisional)
pcpg_tcga Pheochromocytoma and Paraganglioma (TCGA, Provisional)
prad_tcga Prostate Adenocarcinoma (TCGA, Provisional)
sarc_tcga Sarcoma (TCGA, Provisional)
skcm_tcga Skin Cutaneous Melanoma (TCGA, Provisional)
stad_tcga Stomach Adenocarcinoma (TCGA, Provisional)
stad_tcga_pub Stomach Adenocarcinoma (TCGA, Nature 2014)
tgct_tcga Testicular Germ Cell Cancer (TCGA, Provisional)
thca_tcga Thyroid Carcinoma (TCGA, Provisional)
thym_tcga Thymoma (TCGA, Provisional)
ucec_tcga Uterine Corpus Endometrial Carcinoma (TCGA, Provisional)
ucs_tcga Uterine Carcinosarcoma (TCGA, Provisional)
uvm_tcga Uveal Melanoma (TCGA, Provisional)
Also noticed that the following studies has case list of _cna_seq
instead of _cnaseq
. We should make them consistent. @ritikakundra logged here: https://github.com/cBioPortal/datahub/issues/116
acyc_mskcc_2013
blca_mskcc_solit_2012
coadread_tcga_pub
luad_broad
ov_tcga_pub
prad_broad
prad_broad_2013
prad_mich
prad_mskcc
sarc_mskcc
Is there any validation code to catch missing case lists? @pieterlukasse
I am going to make a checklist here for resolved missing cnaseq case lists:
@jjgao there are some studies listed above that do not have overlapping sample ids in cases_sequenced
and cases_cna
. I'll update this list as I go along.
@jjgao I believe there's currently no validation for missing specific case lists, except for the _all
case list:
ERROR: -: No case list found for stable_id 'teststudy_all', consider adding 'add_global_case_list: true' to the study metadata file
Could we also document the required case lists in File Formats and align it with the default selections on the query page?:
Also, are the categories such as all_cases_with_cna_data
still used? The documentation is a bit vague about it.
@angelicaochoa re: no overlapping sample ids in cases_sequenced
and cases_cna
for ov_tcga
, there must be something wrong in the case lists. This query shows mutations and CNAs on the same samples.
@sandertan: I am thinking of re-implementing case lists: https://docs.google.com/document/d/1aBbkTAFv5nCqBv66BOgwvlt5kymgt0lQ7l_pqaFrupc/edit?usp=sharing
@jjgao missing cnaseq case lists are resolved. The only study that did not actually have overlapping CNA and Mutations samples is ov_tcga
.
Thanks, @angelicaochoa!
But in this query (http://www.cbioportal.org/index.do?session_id=5a1eb215498eb8b3d560ef6a) (all ov_tcga samples seleted), there are a lot of samples with both mutations and CNAs. Could you double check if cases_sequenced
and cases_cna
were generated correctly?
@angelicaochoa I tried two queries, one for sequence tumors (http://www.cbioportal.org/index.do?session_id=5a1eb2af498eb8b3d560ef77), and the other for cna tumors (http://www.cbioportal.org/index.do?session_id=5a1eb2bd498eb8b3d560ef79), and looked at the Download tab. There are a lot of overlapped samples between the two queries.
@angelicaochoa did you have a chance to look into the ovarian study? There are definitely many overlapping samples between the mutations and cna profiles.
@jjgao there might have been an issue where the case list sample IDs were not in standardized formats? that should be addressed now however with the latest datahub updates to the provisional TCGA studies
@yichaoS @ritikakundra were TCGA provisional updated in the public portal yet?
@jjgao There might have been an issue where the TCGA sample IDs were not in the standardized/truncated format but I think this should be resolved in the latest TCGA provisional updates to datahub.
@yichaoS @ritikakundra can you confirm whether the missing cnaseq case list for ov_tcga
is resolved?
It's missing from studies e.g. blca_tcga, brca_tcga_pub2015, thca_tcga and more.
We used to auto-generate the case list _cnaseq ("Tumor Samples with sequencing and CNA data") when importing.