Open g27182818 opened 2 years ago
Have you solved this problem?
As I understand the problem is that now the STAR-Count files come with much more information and hence the prepareGDC()
funciton is unable to read this new format. However I decided to open each downloaded file individually and append each needed column in a dataframe. The code I´m using now is this:
library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
experimental.strategy = "RNA-Seq",
workflow.type = "STAR - Counts")
# Get metadata matrix
metadata <- query[[1]][[1]]
# Download data using api
GDCdownload(query, method = "api")
# Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
# Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE))
# Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
# Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
# STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])
# Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
# Read table
test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
# Delete not useful lines
test_tab <- test_tab[-c(1:4),]
# Column bind of tpm and counts data
tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
count_data_frame <- cbind(count_data_frame, test_tab[,4])
# Print progres from 0 to 1
print(i/length(file_list))
}
This works and gets the data but is much slower than the original GDCprepare()
function.
Thanks for your approach
Also had a simliar issue, but now fixed with the update to 2.23.6, i.e. BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
. Thanks for the workaround @g27182818 and the quick update @tiagochst!
Just in case someone has the same problem as me, BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
was showing the following error:
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace 'TCGAbiolinksGUI.data' 1.14.0 is being loaded, but >= 1.15.1 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package 'TCGAbiolinks'
* removing 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
* restoring previous 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
Installation paths not writeable, unable to update packages
path: C:/Program Files/R/R-4.1.2/library
packages:
class, cluster, foreign, MASS, Matrix, mgcv, nlme, nnet, rpart, spatial, survival
Warning message:
In i.p(...) :
installation of package ‘C:/Users/Usuario/AppData/Local/Temp/RtmpKWbI9z/filec681ec526c7/TCGAbiolinks_2.23.7.tar.gz’ had non-zero exit status
And it was because the package TCGAbiolinksGUI.data
had to be also installed directly from GitHub. So, the final way to access the new GDCprepare()
function is:
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
This will first update the TCGAbiolinksGUI.data
to latest 1.15.1
version and then install the fixed version of TCGAbiolinks
.
Yes, I am still updating the package. It might be stable in the next few days. I updated the gene information to use GENCODE v36 as GDC is now using. That is why I need to update TCGAbiolinksGUI.data.
in my case,
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("ExperimentHub")
Restart R
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
works
I've tried all week! @hyjforesight saved me
this is a good step, but I think sample names are missing in the matrix
It takes a very long time after 100% prepare. My console is still busy, is it normal? Should add a notion for such case?
> library(TCGAbiolinks)
> proj <- "TCGA-STAD"
> query <- GDCquery(
+ project = proj,
+ data.category = "Transcriptome Profiling",
+ data.type = "Gene Expression Quantification",
+ workflow.type = "STAR - Counts"
+ )
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-STAD
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
> GDCdownload(query)
Downloading data for project TCGA-STAD
Of the 407 files for download 407 already exist.
All samples have been already downloaded
> data <- GDCprepare(query)
|==================================================================================================================|100% Completed after 50 s
I just found that the code below significantly slow the process.
Instead, use data.table
will speed up:
df = rbindlist(x, use.names = TRUE, idcol = "case_barcode")
data.table::dcast(df, gene_id + gene_name + gene_type ~ case_barcode, value.var = colnames(df)[-c(1:4)])
After the new release of GDC made on March 29, 2022 the
GDCDownload()
function still works but theGDCprepare()
function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:library('TCGAbiolinks') project_name <- "TCGA-ACC" # Defines the query to the GDC query <- GDCquery(project = project_name, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts") # Download data using api GDCdownload(query, method = "api") # Read downloaded data and get a single a summarized experiment object data <- GDCprepare(query, summarizedExperiment = TRUE)
Which produces the following error:
> data <- GDCprepare(query) |===========================================================================================================|100% Completed after 13 s Error in `stop_subscript()`: ! Can't subset columns that don't exist. x Locations 2, 3, and 4 don't exist. i There are only 1 column. Run `rlang::last_error()` to see where the error occurred. There were 50 or more warnings (use warnings() to see the first 50)
Could you please update the tutorials accordingly?
Thanks.
They are being update in the devel version at bioconductor.
https://bioconductor.org/packages/3.15/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
You also need to update the package with the GitHub version.
On Fri, Apr 15, 2022, 2:20 PM Science Peak @.***> wrote:
After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:
library('TCGAbiolinks')project_name <- "TCGA-ACC"# Defines the query to the GDCquery <- GDCquery(project = project_name, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts")
Download data using api
GDCdownload(query, method = "api")# Read downloaded data and get a single a summarized experiment objectdata <- GDCprepare(query, summarizedExperiment = TRUE)
Which produces the following error:
data <- GDCprepare(query) |===========================================================================================================|100% Completed after 13 s Error in
stop_subscript()
: ! Can't subset columns that don't exist. x Locations 2, 3, and 4 don't exist. i There are only 1 column. Runrlang::last_error()
to see where the error occurred. There were 50 or more warnings (use warnings() to see the first 50)Could you please update the tutorials https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html#Search_and_download_data_from_legacy_database_using_GDC_api_method accordingly?
Thanks.
— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493#issuecomment-1100275620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P7CEQR2VOOU4XA22LVFGXORANCNFSM5SAASUWA . You are receiving this because you were mentioned.Message ID: @.***>
I also met this problem!
I met similar problem. It is in SNP data. However, it is not an error. It is warning, a lot of warning.
The code:
query_snp <- GDCquery( project = paste0("TCGA-", cancerType), data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", access = "open" )
GDCdownload(query=query_snp, method = "api", directory = DataDir)
maf <- GDCprepare(query = query_snp, directory = DataDir, save = TRUE, save.filename = "SNP_COAD_data.rda")
There were 50 or more warnings (use warnings() to see the first 50) warnings() 警告資訊: 1: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 2: One or more parsing issues, see
problems()
for details 3: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 4: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 5: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 6: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 7: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 8: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 9: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 10: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 11: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 12: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 13: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 14: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 15: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 16: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 17: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 18: The following named parsers don't match the column names: ALLELE_NUM, MINIMISEDsessionInfo()
R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
[2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base
other attached packages:
[1] MoonlightR_1.20.0 doParallel_1.0.17
[3] iterators_1.0.14 foreach_1.5.2
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[9] IRanges_2.28.0 S4Vectors_0.32.4
[11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.62.0 TCGAbiolinks_2.25.0
loaded via a namespace (and not attached):
[1] shadowtext_0.1.2 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.2.1
[5] plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 splines_4.1.3
[9] BiocParallel_1.28.3 ggplot2_3.3.6
[11] digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.20.0
[15] viridis_0.6.2 GO.db_3.14.0
[17] fansi_1.0.3 magrittr_2.0.3
[19] memoise_2.0.1 tzdb_0.3.0
[21] limma_3.50.3 Biostrings_2.62.0
[23] readr_2.1.2 graphlayouts_0.8.0
[25] vroom_1.5.7 R.utils_2.11.0
[27] enrichplot_1.14.2 prettyunits_1.1.1
[29] jpeg_0.1-9 colorspace_2.0-3
[31] blob_1.2.3 rvest_1.0.2
[33] rappdirs_0.3.3 ggrepel_0.9.1
[35] xfun_0.30 dplyr_1.0.9
[37] tcltk_4.1.3 crayon_1.5.1
[39] RCurl_1.98-1.6 jsonlite_1.8.0
[41] scatterpie_0.1.7 GEOquery_2.62.2
[43] ape_5.6-2 glue_1.6.2
[45] polyclip_1.10-0 gtable_0.3.0
[47] zlibbioc_1.40.0 XVector_0.34.0
[49] DelayedArray_0.20.0 shape_1.4.6
[51] scales_1.2.0 DOSE_3.20.1
[53] HiveR_0.3.63 DBI_1.1.2
[55] Rcpp_1.0.8.3 viridisLite_0.4.0
[57] progress_1.2.2 gridGraphics_0.5-1
[59] tidytree_0.3.9 bit_4.0.4
[61] htmlwidgets_1.5.4 httr_1.4.3
[63] fgsea_1.20.0 gplots_3.1.3
[65] RColorBrewer_1.1-3 ellipsis_0.3.2
[67] R.methodsS3_1.8.1 pkgconfig_2.0.3
[69] XML_3.99-0.9 farver_2.1.0
[71] dbplyr_2.1.1 utf8_1.2.2
[73] RISmed_2.3.0 ggplotify_0.1.0
[75] tidyselect_1.1.2 rlang_1.0.2
[77] reshape2_1.4.4 AnnotationDbi_1.56.2
[79] munsell_0.5.0 tools_4.1.3
[81] cachem_1.0.6 downloader_0.4
[83] cli_3.3.0 generics_0.1.2
[85] RSQLite_2.2.13 stringr_1.4.0
[87] fastmap_1.1.0 ggtree_3.2.1
[89] knitr_1.39 bit64_4.0.5
[91] tidygraph_1.2.1 caTools_1.18.2
[93] rgl_0.108.3 randomForest_4.7-1
[95] purrr_0.3.4 KEGGREST_1.34.0
[97] ggraph_2.0.5 nlme_3.1-157
[99] R.oo_1.24.0 aplot_0.1.4
[101] DO.db_2.9 xml2_1.3.3
[103] biomaRt_2.50.3 compiler_4.1.3
[105] filelock_1.0.2 curl_4.3.2
[107] png_0.1-7 treeio_1.18.1
[109] tibble_3.1.7 tweenr_1.0.2
[111] stringi_1.7.6 TCGAbiolinksGUI.data_1.15.1
[113] lattice_0.20-45 Matrix_1.4-1
[115] vctrs_0.4.1 pillar_1.7.0
[117] lifecycle_1.0.1 GlobalOptions_0.1.2
[119] parmigene_1.1.0 data.table_1.14.2
[121] bitops_1.0-7 patchwork_1.1.1
[123] qvalue_2.26.0 R6_2.5.1
[125] KernSmooth_2.23-20 gridExtra_2.3
[127] codetools_0.2-18 gtools_3.9.2
[129] MASS_7.3-57 assertthat_0.2.1
[131] withr_2.5.0 GenomeInfoDbData_1.2.7
[133] hms_1.1.1 clusterProfiler_4.2.2
[135] grid_4.1.3 ggfun_0.0.6
[137] tidyr_1.2.0 ggforce_0.3.3
@g27182818, Hi are you using the R 4.1.2 version or R.4.2.0 version ?
After the new release of GDC made on March 29, 2022 the
GDCDownload()
function still works but theGDCprepare()
function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:Which produces the following error: