Open Jasonmbg opened 6 years ago
Hi @Jasonmbg TCGAanalyze_Normalization follows the EDASeq package to normalize RNAseq 's data. You have two strategies:
(i) use data from Gene expression aligned against hg19 in that case you can use geneInfo object in TCGAanalyze_Normalization
(ii) use data from Gene expression aligned against hg38 in that case you can use geneInfoHT object in TCGAanalyze_Normalization
If you are interested to use HTSEQ counts of course you need to consider the (ii) strategy.
For the difference about the size of those objects I am working to provide a function to update them especially for the geneInfoHT object.
But for the moment if you follow this part of our vignette
you can perform your downstream DE analysis between cancer and normal that you need.
You had that error because the name of the genes can be as GeneSymbol for hg19 or Ensemble ID for hg38.
geneInfo[100:105,] geneLength gcContent chr
ABCD4 "3649" "0.520416552480132" "chr14" ABCE1 "4531" "0.380710659898477" "chr4" ABCF1 "3594" "0.525319977740679" "chr6" ABCF2 "4308" "0.527623026926648" "chr7" ABCF3 "2790" "0.57741935483871" "chr3" ABCG1 "4972" "0.551689460981496" "chr21" head(geneInfoHT) geneLength gcContent ENSG00000000003 4535 0.39956 ENSG00000000005 1610 0.42484 ENSG00000000419 1207 0.41591 ENSG00000000457 6883 0.41174 ENSG00000000460 5967 0.42986 ENSG00000000938 3474 0.57340
If you need anything else you can write me back, Thank you. Antonio.
Dear Antonio, thank you for your valuable answer and considerations on this matter.
i actually followed your vignette and to proceed i used the second option you mentioned, as i have downloaded the harmonized data with the raw HTSEq counts:
test3 <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = exp.hg38,geneInfo=TCGAbiolinks::geneInfoHT,method="gcContent")
dataFilt <- TCGAanalyze_Filtering(tabDF = test3, method = "quantile", qnt.cut = 0.25)
samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
dataDEGs <- TCGAanalyze_DEA(dataFilt[,samplesNT],
dataFilt[,samplesTP],"Normal", "Tumor")
dataDEGs.filt <- TCGAanalyze_DEA(dataFilt[,samplesNT],
dataFilt[,samplesTP],"Normal", "Tumor",fdr.cut = 0.01,
logFC.cut = 1,method = "glmLRT")
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal", dataFilt[,samplesTP],dataFilt[,samplesNT])
Thus, you believe at the current moment this discrepancy in the number of features returned using the geneInfo=TCGAbiolinks::geneInfoHT with the method="gcContent" would not have a profound effect on DE analysis, correct ?
My second issue, was that afterwards i tried to implement a quick survival analysis using the TCGAanalyze_SurvivalKM:
groupNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT"))
groupC <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP"))
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical_patient=clinical_patient_Cancer,dataGE=dataFilt,Genelist = genes.sel,Survresult = F,ThreshTop=0.67,ThreshDown=0.33,group1=groupNT, group2=groupC)
5549.
Error in survdiff.fit(y, groups, strata.keep, rho) :
There is only 1 group
Any idea about this weird error ?
sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7600)
locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] edgeR_3.16.5 limma_3.30.13
[3] org.Hs.eg.db_3.4.0 AnnotationDbi_1.36.2
[5] EDASeq_2.8.0 ShortRead_1.32.1
[7] GenomicAlignments_1.10.1 Rsamtools_1.26.2
[9] Biostrings_2.42.1 XVector_0.14.1
[11] BiocParallel_1.8.2 biomaRt_2.30.0
[13] SummarizedExperiment_1.4.0 Biobase_2.34.0
[15] GenomicRanges_1.26.4 GenomeInfoDb_1.10.3
[17] IRanges_2.8.2 S4Vectors_0.12.2
[19] BiocGenerics_0.20.0 TCGAbiolinks_2.7.4
[21] devtools_1.13.2
Best Regards,
Efstathios
Dear BioCommunity,
i recently implemented the R package TCGAbiolinks, to download raw HTSEQ counts for a provinsional cancer TCGA dataset (COAD). My code used was the following:
Any ideas about the first above error ? as also why the second approach with gcContent worked, but the different number of features in the second approach is reduced to more than the half of the 56963 initial number ? From a small search in the function, in mentions in the geneInfo argument:
geneInfo= Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo
However, how it explains the above issue ? perhaps it has to do something with the EDASeq R package, which is mentioned for implementing the above normalization function ?
Briefly, my main goals here, is to perform downstream DE analysis between cancer and normal samples, as also a survival analysis with a specific subset of genes.
Thank you in advance,
Efstathios