SKCM analysis - Githubissues

caloto commented 6 years ago

Hello! First of all thank you so much for sharing this app with the community!! It is very easy to use and really really powerful

During the analysis of SKCM, a error is reported as I show you below:

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],

mat2 = dataFilt[,samplesTP],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT") Batch correction skipped since no factors provided ----------------------- DEA ------------------------------- there are Cond1 type Normal in samples there are Cond2 type Tumor in 103 samples there are 14893 features as miRNA or genes I Need about 52 seconds for this DEA. [Processing 30k elements /s]
Error in edgeR::DGEList(counts = TOC, group = tumorType) : Length of 'group' must equal number of columns in 'counts'

The full code is here:

query <- GDCquery(project = "TCGA-SKCM",

data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = "Illumina HiSeq",
file.type = "results",
legacy = TRUE)

GDCdownload(query)

RnaseqSE <- GDCprepare(query)

Rnaseq_CorOutliers <- TCGAanalyze_Preprocessing(RnaseqSE)

dataNorm <- TCGAanalyze_Normalization(tabDF = RnaseqSE, geneInfo = geneInfo)

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,

method = "quantile",
qnt.cut = 0.25)

samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("NT"))

samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("TP"))

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],

mat2 = dataFilt[,samplesTP],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")

And, finally, I would like to ask you a brief question:

During the previous normalisation process, we are using the FPKM-UQ method. I am looking for the expression of a subset of genes, looking for differential expression. Should I use another approach or extract them from the final 'dataDEGs' table is alright??

Thank you for your help, and congrats once more!

torongs82 commented 6 years ago

Hi @caloto thank you for using our package TCGAbiolinks. We are here to help you and the community performing their analysis in a better way. I looked in your workflow and it seems correct, but it is possible that you have missing normal samples (samplesNT) for that comparison using SKCM as cancer type. I suggest you to check in the sampleNT variable, but in the meanwhile if you are interested you can compare also your 103 SKCM TP samples with available TCGA SKCM TM Metastatic 369 samples. Or integrating the 607 GTEx skin normal samples and compare with the 103 TCGA SKCM TP sample. For instance you can follow the code dataGTEx_skin <- TCGAquery_recount2(project = "gtex", tissue = "skin"). For data with FPKM-UQ normalization method you need to retrieve the data aligned with hg38 reference genome. query.exp.hg38 <- GDCquery(project = "TCGA-SKCM", data.category = "Trascriptome Profiling", data.type = "Gene expression quantification", workflow.type="HTSeq - FPKM-UQ")

Ps. You can also consider molecular subtypes for your comparisons dataSubt <- TCGAquery_subtype(tumor = "SKCM") skcm subtype information from:doi:10.1016/j.cell.2015.05.044

table(dataSubt$MUTATIONSUBTYPES)

BRAF_Hotspot_Mutants NF1_Any_Mutants RAS_Hotspot_Mutants Triple_WT 17 150 28 92 46

For everything else please write us back. Thank you.

caloto commented 6 years ago

Hello @torongs82 , thank you for your quick response!

As far as I know, SKCM TCGA tumor should have 'normal solid tissue' since the following code does not returns an empty 'nl_ge_files' variable:

library(GenomicDataCommons)

nl_ge_files = files() %>% GenomicDataCommons::filter(~ cases.samples.sample_type=='Solid Tissue Normal' & cases.project.project_id == 'TCGA-SKCM' & analysis.workflow_type == "HTSeq - Counts") %>% expand(c('cases','cases.samples')) %>% results_all() %>% as_tibble()

The same problem is in the case of 'TCGA_STAD'.

GTEx integration is a great idea, but unfortunately, my pipeline is firm to only use TCGA data.

Please, correct me if it is needed, but I 'TCGAanalyze_Normalization' function by default normalise the data by 'Gene length' so FPKM method, isn't it? Then, we filter only the quantile, so FPKM-UQ. I do not know if i have understood this well...

Finally, thank you for the advice. In fact, I was planning to perform the differential analysis over all possible cancer subtypes.

Thank you again for your amazing tool!

torongs82 commented 6 years ago

Hi @caloto thank you again for you interest in TCGAbiolinks.

If you need samples with HTSeq-Counts you can follow the pipeline:

query.exp.hg38 <- GDCquery(project = "TCGA-SKCM", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type="HTSeq - FPKM-UQ")

dataSKCM_barcodes <- query.exp.hg38$results[[1]]$cases

Looking in the different sample types:

table(substr(dataSKCM_barcodes,14,15))

01 06 07 11 103 367 1 1

But I can see only 1 normal sample (NT | Solid Tissue Normal) that you can easily detect with the function TCGAquery_SampleTypes(barcode, typesample).

I also looked in your example with GenomicDataCommons and there is only one NT sample.

table.code TP TR TB TRBM TAP TM TAM THOC TBM NB NT NBC NEBV NBM CELLC TRB CELL XP XCL "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "20" "40" "50" "60" "61"

Thank you, best.

caloto commented 6 years ago

@torongs82 You are totally right, I have seen it from GDC directly. Thank you for your help!!

BioinformaticsFMRP / TCGAbiolinks

SKCM analysis #268