BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

GDCprepare() does not work with last update of GDC v32.0 for RNA-Seq #493

Open g27182818 opened 2 years ago

g27182818 commented 2 years ago

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)
guohout commented 2 years ago

Have you solved this problem?

g27182818 commented 2 years ago

As I understand the problem is that now the STAR-Count files come with much more information and hence the prepareGDC() funciton is unable to read this new format. However I decided to open each downloaded file individually and append each needed column in a dataframe. The code I´m using now is this:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"

# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Get metadata matrix
metadata <- query[[1]][[1]]

# Download data using api
GDCdownload(query, method = "api")

# Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
# Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE)) 

# Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
# Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
# STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])

# Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
  # Read table
  test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
  # Delete not useful lines
  test_tab <- test_tab[-c(1:4),]
  # Column bind of tpm and counts data
  tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
  count_data_frame <- cbind(count_data_frame, test_tab[,4])
  # Print progres from 0 to 1
  print(i/length(file_list))
}

This works and gets the data but is much slower than the original GDCprepare() function.

guohout commented 2 years ago

Thanks for your approach

t-carroll commented 2 years ago

Also had a simliar issue, but now fixed with the update to 2.23.6, i.e. BiocManager::install("BioinformaticsFMRP/TCGAbiolinks"). Thanks for the workaround @g27182818 and the quick update @tiagochst!

g27182818 commented 2 years ago

Just in case someone has the same problem as me, BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") was showing the following error:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace 'TCGAbiolinksGUI.data' 1.14.0 is being loaded, but >= 1.15.1 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package 'TCGAbiolinks'
* removing 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
* restoring previous 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.1.2/library
  packages:
    class, cluster, foreign, MASS, Matrix, mgcv, nlme, nnet, rpart, spatial, survival
Warning message:
In i.p(...) :
  installation of package ‘C:/Users/Usuario/AppData/Local/Temp/RtmpKWbI9z/filec681ec526c7/TCGAbiolinks_2.23.7.tar.gz’ had non-zero exit status

And it was because the package TCGAbiolinksGUI.data had to be also installed directly from GitHub. So, the final way to access the new GDCprepare() function is:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

This will first update the TCGAbiolinksGUI.data to latest 1.15.1 version and then install the fixed version of TCGAbiolinks.

tiagochst commented 2 years ago

Yes, I am still updating the package. It might be stable in the next few days. I updated the gene information to use GENCODE v36 as GDC is now using. That is why I need to update TCGAbiolinksGUI.data.

hyjforesight commented 2 years ago

in my case,

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("ExperimentHub")

Restart R

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

works

aysenuroner commented 2 years ago

I've tried all week! @hyjforesight saved me

snijesh commented 2 years ago

this is a good step, but I think sample names are missing in the matrix

ShixiangWang commented 2 years ago

It takes a very long time after 100% prepare. My console is still busy, is it normal? Should add a notion for such case?

> library(TCGAbiolinks)
> proj <- "TCGA-STAD"
> query <- GDCquery(
+   project = proj,
+   data.category = "Transcriptome Profiling",
+   data.type = "Gene Expression Quantification",
+   workflow.type = "STAR - Counts"
+ )
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-STAD
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
> GDCdownload(query)
Downloading data for project TCGA-STAD
Of the 407 files for download 407 already exist.
All samples have been already downloaded
> data <- GDCprepare(query)
|==================================================================================================================|100%                      Completed after 50 s 
ShixiangWang commented 2 years ago

I just found that the code below significantly slow the process.

https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/6cd187eb10b27b260e16c7cb25216fdef919d43d/R/prepare.R#L1448-L1451

Instead, use data.table will speed up:

df = rbindlist(x, use.names = TRUE, idcol = "case_barcode")
data.table::dcast(df, gene_id + gene_name + gene_type ~ case_barcode, value.var = colnames(df)[-c(1:4)])
sciencepeak commented 2 years ago

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials accordingly?

Thanks.

tiagochst commented 2 years ago

They are being update in the devel version at bioconductor.

https://bioconductor.org/packages/3.15/bioc/vignettes/TCGAbiolinks/inst/doc/index.html

You also need to update the package with the GitHub version.

On Fri, Apr 15, 2022, 2:20 PM Science Peak @.***> wrote:

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')project_name <- "TCGA-ACC"# Defines the query to the GDCquery <- GDCquery(project = project_name, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts")

Download data using api

GDCdownload(query, method = "api")# Read downloaded data and get a single a summarized experiment objectdata <- GDCprepare(query, summarizedExperiment = TRUE)

Which produces the following error:

data <- GDCprepare(query) |===========================================================================================================|100% Completed after 13 s Error in stop_subscript(): ! Can't subset columns that don't exist. x Locations 2, 3, and 4 don't exist. i There are only 1 column. Run rlang::last_error() to see where the error occurred. There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html#Search_and_download_data_from_legacy_database_using_GDC_api_method accordingly?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493#issuecomment-1100275620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P7CEQR2VOOU4XA22LVFGXORANCNFSM5SAASUWA . You are receiving this because you were mentioned.Message ID: @.***>

PearlLiu-Dev commented 2 years ago

I also met this problem!

git-jrwang commented 2 years ago

I met similar problem. It is in SNP data. However, it is not an error. It is warning, a lot of warning.

The code:

query_snp <- GDCquery( project = paste0("TCGA-", cancerType), data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", access = "open" )

GDCdownload(query=query_snp, method = "api", directory = DataDir)

maf <- GDCprepare(query = query_snp, directory = DataDir, save = TRUE, save.filename = "SNP_COAD_data.rda")

There were 50 or more warnings (use warnings() to see the first 50) warnings() 警告資訊: 1: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 2: One or more parsing issues, see problems() for details 3: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 4: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 5: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 6: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 7: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 8: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 9: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 10: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 11: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 12: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 13: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 14: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 15: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 16: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 17: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 18: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED

sessionInfo()

R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] MoonlightR_1.20.0 doParallel_1.0.17
[3] iterators_1.0.14 foreach_1.5.2
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[9] IRanges_2.28.0 S4Vectors_0.32.4
[11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.62.0 TCGAbiolinks_2.25.0

loaded via a namespace (and not attached): [1] shadowtext_0.1.2 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.2.1
[5] plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 splines_4.1.3
[9] BiocParallel_1.28.3 ggplot2_3.3.6
[11] digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.20.0
[15] viridis_0.6.2 GO.db_3.14.0
[17] fansi_1.0.3 magrittr_2.0.3
[19] memoise_2.0.1 tzdb_0.3.0
[21] limma_3.50.3 Biostrings_2.62.0
[23] readr_2.1.2 graphlayouts_0.8.0
[25] vroom_1.5.7 R.utils_2.11.0
[27] enrichplot_1.14.2 prettyunits_1.1.1
[29] jpeg_0.1-9 colorspace_2.0-3
[31] blob_1.2.3 rvest_1.0.2
[33] rappdirs_0.3.3 ggrepel_0.9.1
[35] xfun_0.30 dplyr_1.0.9
[37] tcltk_4.1.3 crayon_1.5.1
[39] RCurl_1.98-1.6 jsonlite_1.8.0
[41] scatterpie_0.1.7 GEOquery_2.62.2
[43] ape_5.6-2 glue_1.6.2
[45] polyclip_1.10-0 gtable_0.3.0
[47] zlibbioc_1.40.0 XVector_0.34.0
[49] DelayedArray_0.20.0 shape_1.4.6
[51] scales_1.2.0 DOSE_3.20.1
[53] HiveR_0.3.63 DBI_1.1.2
[55] Rcpp_1.0.8.3 viridisLite_0.4.0
[57] progress_1.2.2 gridGraphics_0.5-1
[59] tidytree_0.3.9 bit_4.0.4
[61] htmlwidgets_1.5.4 httr_1.4.3
[63] fgsea_1.20.0 gplots_3.1.3
[65] RColorBrewer_1.1-3 ellipsis_0.3.2
[67] R.methodsS3_1.8.1 pkgconfig_2.0.3
[69] XML_3.99-0.9 farver_2.1.0
[71] dbplyr_2.1.1 utf8_1.2.2
[73] RISmed_2.3.0 ggplotify_0.1.0
[75] tidyselect_1.1.2 rlang_1.0.2
[77] reshape2_1.4.4 AnnotationDbi_1.56.2
[79] munsell_0.5.0 tools_4.1.3
[81] cachem_1.0.6 downloader_0.4
[83] cli_3.3.0 generics_0.1.2
[85] RSQLite_2.2.13 stringr_1.4.0
[87] fastmap_1.1.0 ggtree_3.2.1
[89] knitr_1.39 bit64_4.0.5
[91] tidygraph_1.2.1 caTools_1.18.2
[93] rgl_0.108.3 randomForest_4.7-1
[95] purrr_0.3.4 KEGGREST_1.34.0
[97] ggraph_2.0.5 nlme_3.1-157
[99] R.oo_1.24.0 aplot_0.1.4
[101] DO.db_2.9 xml2_1.3.3
[103] biomaRt_2.50.3 compiler_4.1.3
[105] filelock_1.0.2 curl_4.3.2
[107] png_0.1-7 treeio_1.18.1
[109] tibble_3.1.7 tweenr_1.0.2
[111] stringi_1.7.6 TCGAbiolinksGUI.data_1.15.1 [113] lattice_0.20-45 Matrix_1.4-1
[115] vctrs_0.4.1 pillar_1.7.0
[117] lifecycle_1.0.1 GlobalOptions_0.1.2
[119] parmigene_1.1.0 data.table_1.14.2
[121] bitops_1.0-7 patchwork_1.1.1
[123] qvalue_2.26.0 R6_2.5.1
[125] KernSmooth_2.23-20 gridExtra_2.3
[127] codetools_0.2-18 gtools_3.9.2
[129] MASS_7.3-57 assertthat_0.2.1
[131] withr_2.5.0 GenomeInfoDbData_1.2.7
[133] hms_1.1.1 clusterProfiler_4.2.2
[135] grid_4.1.3 ggfun_0.0.6
[137] tidyr_1.2.0 ggforce_0.3.3

qiz218591 commented 2 years ago

@g27182818, Hi are you using the R 4.1.2 version or R.4.2.0 version ?