BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
293 stars 112 forks source link

Can't do COAD data normalization #515

Open git-jrwang opened 2 years ago

git-jrwang commented 2 years ago

Hi:

I want to use TCGAbiolinks to analyze COAD expression data. Here is what I did.

#Query platform Illunmina hiSeq with a list of barcode
query_Data <- GDCquery(
  project = paste0("TCGA-",cancerType),
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)

DataDir <- c("F:/Colon Cancer Project Data/R_data")

GDCdownload(query=query_Data,
            method = "api",
            directory = DataDir)

dataCOAD <- GDCprepare(query=query_Data,
                       directory = DataDir,
                       save = TRUE,
                       save.filename = "RNAseq_COAD_data.rda")

COAD_RNAseq <- TCGAanalyze_Preprocessing(data)

dataNorm <- TCGAanalyze_Normalization(
  tabDF = COAD_RNAseq, 
  geneInfo =  geneInfoHT
)

In the normalization process, I get the following error. 
I Need about  215 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Step 3 of 4: betweenLaneNormalization ...
Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

I also check COAD_RNAseq, and there is no NaN or missing value.

Since there is no option to omit NaN, Any suggestion?

sessionInfo() R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.utf8 [2] LC_CTYPE=Chinese (Traditional)_Taiwan.utf8
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[4] GenomeInfoDb_1.32.1 IRanges_2.30.0 S4Vectors_0.34.0
[7] BiocGenerics_0.42.0 MatrixGenerics_1.8.0 matrixStats_0.62.0
[10] TCGAbiolinks_2.24.0

loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 RColorBrewer_1.1-3
[4] filelock_1.0.2 progress_1.2.2 httr_1.4.3
[7] tools_4.2.0 utf8_1.2.2 R6_2.5.1
[10] DBI_1.1.2 colorspace_2.0-3 tidyselect_1.1.2
[13] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2
[16] compiler_4.2.0 cli_3.3.0 rvest_1.0.2
[19] xml2_1.3.3 DelayedArray_0.22.0 rtracklayer_1.56.0
[22] scales_1.2.0 readr_2.1.2 rappdirs_0.3.3
[25] stringr_1.4.0 digest_0.6.29 Rsamtools_2.12.0
[28] R.utils_2.11.0 XVector_0.36.0 jpeg_0.1-9
[31] pkgconfig_2.0.3 dbplyr_2.1.1 fastmap_1.1.0
[34] rlang_1.0.2 RSQLite_2.2.14 BiocIO_1.6.0
[37] generics_0.1.2 hwriter_1.3.2.1 jsonlite_1.8.0
[40] BiocParallel_1.30.0 dplyr_1.0.9 R.oo_1.24.0
[43] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.8
[46] Matrix_1.4-1 Rcpp_1.0.8.3 munsell_0.5.0
[49] fansi_1.0.3 lifecycle_1.0.1 R.methodsS3_1.8.1
[52] yaml_2.3.5 stringi_1.7.6 zlibbioc_1.42.0
[55] plyr_1.8.7 BiocFileCache_2.4.0 grid_4.2.0
[58] blob_1.2.3 parallel_4.2.0 crayon_1.5.1
[61] lattice_0.20-45 Biostrings_2.64.0 GenomicFeatures_1.48.0
[64] hms_1.1.1 KEGGREST_1.36.0 EDASeq_2.30.0
[67] knitr_1.39 pillar_1.7.0 rjson_0.2.21
[70] TCGAbiolinksGUI.data_1.16.0 biomaRt_2.52.0 XML_3.99-0.9
[73] glue_1.6.2 ShortRead_1.54.0 latticeExtra_0.6-29
[76] downloader_0.4 data.table_1.14.2 BiocManager_1.30.17
[79] png_0.1-7 vctrs_0.4.1 tzdb_0.3.0
[82] gtable_0.3.0 purrr_0.3.4 tidyr_1.2.0
[85] assertthat_0.2.1 cachem_1.0.6 ggplot2_3.3.6
[88] xfun_0.30 aroma.light_3.26.0 restfulr_0.0.13
[91] tibble_3.1.7 GenomicAlignments_1.32.0 AnnotationDbi_1.58.0
[94] memoise_2.0.1 ellipsis_0.3.2

tiagochst commented 2 years ago

In your code, you are passing data to TCGAanalyze_Preprocessing not dataCOAD

dataCOAD <- GDCprepare(query=query_Data,
                       directory = DataDir,
                       save = TRUE,
                       save.filename = "RNAseq_COAD_data.rda")

COAD_RNAseq <- TCGAanalyze_Preprocessing(data)

should be

dataCOAD <- GDCprepare(query=query_Data,
                       directory = DataDir,
                       save = TRUE,
                       save.filename = "RNAseq_COAD_data.rda")

COAD_RNAseq <- TCGAanalyze_Preprocessing(dataCOAD)
snashraf commented 2 years ago

Hi Tiago,

I have a similar issue and I got your answer.

is there a way to use the "RNAseq_COAD_data.rda" file again? I mean I have downloaded the data already and now I want to perform the analysis again. What could be the best way for that ?

Regards,

snashraf commented 2 years ago

Hi

I am getting error when I am performing method as GC content. I was able to run successfully without "gcContent". Wht do you think will be reason for this ?

data <- get(load("scripts/TCGA-ACC.Rda"))

dataPrep <- TCGAanalyze_Preprocessing(
  object = data, 
  cor.cut = 0.6
)

dataNorm <- TCGAanalyze_Normalization(
  tabDF = dataPrep,
  geneInfo = geneInfoHT,
  method = "gcContent"
)
dataNorm <- TCGAanalyze_Normalization(
  tabDF = dataPrep,
  geneInfo = geneInfoHT,
  method = "gcContent"
)

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 12.3.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid stats4 parallel stats graphics grDevices utils datasets methods
[10] base

other attached packages: [1] preprocessCore_1.52.1 ComplexHeatmap_2.6.2 EDASeq_2.24.0
[4] ShortRead_1.48.0 GenomicAlignments_1.26.0 SummarizedExperiment_1.20.0 [7] MatrixGenerics_1.2.1 matrixStats_0.62.0 Rsamtools_2.6.0
[10] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 Biostrings_2.58.0
[13] XVector_0.30.0 IRanges_2.24.1 S4Vectors_0.28.1
[16] BiocParallel_1.24.1 Biobase_2.50.0 BiocGenerics_0.36.1
[19] TCGAbiolinks_2.25.0