RMolania / TCGA_PanCancer_UnwantedVariation

45 stars 15 forks source link

Incomplete TCGA Cancer Subtypes #2

Open DarioS opened 2 years ago

DarioS commented 2 years ago

At present, TCGA Pan Cancer Datasets supports Cancer Biology for only four Cancer types. These four cancer type (TCGA datasets) are Breast Cancer (BRCA), Lung Cancer (LUAD), Colon Cancer (COAD) and Rectum Cancer (READ). This implies that RUV-III analysis can only be performed for these four cancer types. This is because the RUV-III approach here requires at least one roughly known biologically homogeneous subclass of samples shared across sources of unwanted variation.

I notice that there are more TCGA projects which have subtype information. For example, from Genomic Classification of Cutaneous Melanoma, Cell, 2015 has

BRAF Subtype The largest genomic subtype is defined by the presence of BRAF hot-spot mutations. RAS Subtype The second major subtype is defined by the presence of RAS hot-spot mutations, including known amino acid changes with functional consequences, in all three RAS family members (N-, K- and H-RAS). NF1 Subtype The third most frequently observed SMG in the MAPK pathway was NF1, which was mutated in 14% of samples. Triple Wild-Type Subtype We defined the Triple-WT subtype (n = 46) as a heterogeneous subgroup characterized by a lack of hot-spot BRAF, N/H/K-RAS, or NF1 mutations.

and this is reflected in Biocondctor's curatedTCGAData package.

library(curatedTCGAData)
cutaneousMelanoma <- curatedTCGAData("SKCM", "Mutation", "2.0.1", FALSE)
head(colData(cutaneousMelanoma)[, c("patientID", "MUTATIONSUBTYPES")])
DataFrame with 6 rows and 2 columns
                patientID     MUTATIONSUBTYPES
              <character>          <character>
TCGA-BF-A1PU TCGA-BF-A1PU BRAF_Hotspot_Mutants
TCGA-BF-A1PV TCGA-BF-A1PV  RAS_Hotspot_Mutants
TCGA-BF-A1PX TCGA-BF-A1PX BRAF_Hotspot_Mutants
TCGA-BF-A1PZ TCGA-BF-A1PZ  RAS_Hotspot_Mutants
TCGA-BF-A1Q0 TCGA-BF-A1Q0            Triple_WT
TCGA-BF-A3DJ TCGA-BF-A3DJ BRAF_Hotspot_Mutants

Could the preprocessed data provided be more comprehensive or is there something special that I am overlooking which means that a data set such as melanoma can't actually be processed using the PRPS method?

RMolania commented 2 years ago

Hi Dario, Thanks for your questions. To answer your question regarding the Pan Cancer RNA-seq datasets:I have created SummarizedExperimentall objects for all the TCGA RNA-seq studies including SKCM. I have collected many possible sample annotations and batch details for each cancer type which can help TCGA users to better understand the data, particularly different sources of unwanted variation. However, that was almost impossible to accurately identify "gene expression based" subtypes for all cancer types as this requires careful analysis and prior knowledge about each cancer type. We are currently working on some other TCGA cancer types including SKCM to find major biological subtypes in oder to be able to use RUV-III-PRPS.
For all TCGA BRCA, LUAD, COAD and READ RNA-seq studies, we either found the cancer subtypes by ourself or contacted TCGA research network to provide us those details.