BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

Error and weird result of TCGAanalyze_Normalization function using TCGAbiolinks for normalization of raw HTSeq counts #164

Open Jasonmbg opened 6 years ago

Jasonmbg commented 6 years ago

Dear BioCommunity,

i recently implemented the R package TCGAbiolinks, to download raw HTSEQ counts for a provinsional cancer TCGA dataset (COAD). My code used was the following:

query.exp.hg38 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

GDCdownload(query.exp.hg38,files.per.chunk = 50)

exp.hg38 <- GDCprepare(query = query.exp.hg38, save=TRUE,directory=mypath,save.filename = "exp.coad.htseqcounts.rda")

exp.hg38 class: RangedSummarizedExperiment dim: 56963 521 metadata(1): data_release assays(1): HTSeq - Counts rownames(56963): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920 rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id colnames(521): TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07 ... TCGA-AA-3675-01A-02R-0905-07 TCGA-G4-6323-01A-11R-1723-07 colData names(101): sample patient ... subtype_vascular_invasion_present subtype_vital_status

test <- TCGAanalyze_Normalization(tabDF=exp.hg38,geneInfo) I Need about 0 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ... Step 2 of 4: withinLaneNormalization ... Error in names(y) <- 1:length(y) : 'names' attribute [2] must be the same length as the vector [0] Timing stopped at: 0 0 0.04

test2 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent") I Need about 371 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s]
Step 1 of 4: newSeqExpressionSet ... Step 2 of 4: withinLaneNormalization ... Step 3 of 4: betweenLaneNormalization ... Step 4 of 4: .quantileNormalization ...

dim(test3) [1] 23238 521

Any ideas about the first above error ? as also why the second approach with gcContent worked, but the different number of features in the second approach is reduced to more than the half of the 56963 initial number ? From a small search in the function, in mentions in the geneInfo argument:

geneInfo= Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo

However, how it explains the above issue ? perhaps it has to do something with the EDASeq R package, which is mentioned for implementing the above normalization function ?

Briefly, my main goals here, is to perform downstream DE analysis between cancer and normal samples, as also a survival analysis with a specific subset of genes.

Thank you in advance,

Efstathios

torongs82 commented 6 years ago

Hi @Jasonmbg TCGAanalyze_Normalization follows the EDASeq package to normalize RNAseq 's data. You have two strategies:

(i) use data from Gene expression aligned against hg19 in that case you can use geneInfo object in TCGAanalyze_Normalization

(ii) use data from Gene expression aligned against hg38 in that case you can use geneInfoHT object in TCGAanalyze_Normalization

If you are interested to use HTSEQ counts of course you need to consider the (ii) strategy.

For the difference about the size of those objects I am working to provide a function to update them especially for the geneInfoHT object.

But for the moment if you follow this part of our vignette

http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html#htseq_data:_downstream_analysis_brca

you can perform your downstream DE analysis between cancer and normal that you need.

You had that error because the name of the genes can be as GeneSymbol for hg19 or Ensemble ID for hg38.

geneInfo[100:105,] geneLength gcContent chr
ABCD4 "3649" "0.520416552480132" "chr14" ABCE1 "4531" "0.380710659898477" "chr4" ABCF1 "3594" "0.525319977740679" "chr6" ABCF2 "4308" "0.527623026926648" "chr7" ABCF3 "2790" "0.57741935483871" "chr3" ABCG1 "4972" "0.551689460981496" "chr21" head(geneInfoHT) geneLength gcContent ENSG00000000003 4535 0.39956 ENSG00000000005 1610 0.42484 ENSG00000000419 1207 0.41591 ENSG00000000457 6883 0.41174 ENSG00000000460 5967 0.42986 ENSG00000000938 3474 0.57340

If you need anything else you can write me back, Thank you. Antonio.

Jasonmbg commented 6 years ago

Dear Antonio, thank you for your valuable answer and considerations on this matter.

i actually followed your vignette and to proceed i used the second option you mentioned, as i have downloaded the harmonized data with the raw HTSEq counts:

test3 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent")

dataFilt <- TCGAanalyze_Filtering(tabDF = test3, method = "quantile", qnt.cut =  0.25)

samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))

samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))

dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
dataFilt[,samplesTP],"Normal", "Tumor")

dataDEGs.filt <- TCGAanalyze_DEA(dataFilt[,samplesNT],
dataFilt[,samplesTP],"Normal", "Tumor",fdr.cut = 0.01,
logFC.cut = 1,method = "glmLRT")

dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",                  dataFilt[,samplesTP],dataFilt[,samplesNT])

Thus, you believe at the current moment this discrepancy in the number of features returned using the geneInfo=TCGAbiolinks::geneInfoHT with the method="gcContent" would not have a profound effect on DE analysis, correct ?

My second issue, was that afterwards i tried to implement a quick survival analysis using the TCGAanalyze_SurvivalKM:

groupNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
groupC <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))

tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient=clinical_patient_Cancer,dataGE=dataFilt,Genelist = genes.sel,Survresult = F,ThreshTop=0.67,ThreshDown=0.33,group1=groupNT, group2=groupC)
5549.
Error in survdiff.fit(y, groups, strata.keep, rho) : 
  There is only 1 group

Any idea about this weird error ?

sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600)

locale: [1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253

attached base packages: [1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base

other attached packages: [1] edgeR_3.16.5 limma_3.30.13
[3] org.Hs.eg.db_3.4.0 AnnotationDbi_1.36.2
[5] EDASeq_2.8.0 ShortRead_1.32.1
[7] GenomicAlignments_1.10.1 Rsamtools_1.26.2
[9] Biostrings_2.42.1 XVector_0.14.1
[11] BiocParallel_1.8.2 biomaRt_2.30.0
[13] SummarizedExperiment_1.4.0 Biobase_2.34.0
[15] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
[17] IRanges_2.8.2 S4Vectors_0.12.2
[19] BiocGenerics_0.20.0 TCGAbiolinks_2.7.4
[21] devtools_1.13.2

Best Regards,

Efstathios