Closed HHHit closed 6 years ago
This data is controlled, you need to set the token.file argument.
And you should use the GDC client tools as suggested by GDC when the data is too big.
GDCdownload(query, method = "client", token.file = file)
TCGAbiolinks is also using gdc-client for those cases. I'm quite sure the API method will not work.
Is there anyway that I could bypass the gdc-client? Since in the system I am required to run on, gdc-client cannot be installed.
I don't think so.
Ok, thanks for the reply.
Hi, I'm running into the same issue, no matter how big the chunks are:
Downloading data for project TCGA-LUAD
GDCdownload will download 2167 files. A total of 214.444281679 GB
Downloading chunk 1 of 434 (5 files, size = 404.137405 MB) as Wed_Jan_24_13_36_06_2018_0.tar.gz
Downloading: 350 MB At least one of the chunks download was not correct. We will retry
Downloading chunk 1 of 434 (5 files, size = 404.137405 MB) as Wed_Jan_24_13_36_06_2018_0.tar.gz
Downloading: 350 MB Error in GDCdownload.aux(server, manifest.aux, name.aux, path) :
There was an error in the download process (we might had a connection problem with GDC server).
Please run this function it again.
Try using method = `client` or setting files.per.chunk to a small number.
Of course I've tried with method='client'
, but then I get the following error:
Downloading data for project TCGA-LUAD
probando la URL 'https://gdc.cancer.gov/system/files/authenticated%20user/0/gdc-client_v1.3.0_Ubuntu14.04_x64.zip'
Content type 'application/zip' length 24647158 bytes (23.5 MB)
==================================================
downloaded 23.5 MB
GDCdownload will download: 214.444281679 GB
Executing GDC client with the following command:
./gdc-client download -m gdc_manifest.txt
Traceback (most recent call last):
File "gdc-client", line 7, in <module>
File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
File "build/bdist.linux-x86_64/egg/gdc_client/upload/__init__.py", line 1, in <module>
File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
File "build/bdist.linux-x86_64/egg/gdc_client/upload/parser.py", line 12, in <module>
File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
File "build/bdist.linux-x86_64/egg/gdc_client/upload/client.py", line 6, in <module>
File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 546, in load_module
ImportError: /usr/lib64/libc.so.6: version `GLIBC_2.18' not found (required by /tmp/_MEI2BYIr4/libstdc++.so.6)
Failed to execute script gdc-client
Error in move(i, file.path(path, i)) :
I could not find the file: a11a4f1e-3b68-4c0e-8259-7ab630658a7c
Any hints on this issue?
Please, could you send your sessionInfo()
and the query and download code you are using?
Sure. Query and download code:
p <- 'TCGA-LUAD'
clin.query <- GDCquery(project=p, data.category='Clinical', legacy = TRUE)
GDCdownload(clin.query, method = "client")
sessionInfo()
:
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /mnt/beegfs/soft/R/R-3.4.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/beegfs/soft/R/R-3.4.2/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
[5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] TCGAbiolinks_2.7.19
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 selectr_0.3-1
[3] rjson_0.2.15 hwriter_1.3.2
[5] class_7.3-14 modeltools_0.2-21
[7] mclust_5.4 circlize_0.4.3
[9] XVector_0.16.0 GenomicRanges_1.28.6
[11] GlobalOptions_0.0.12 ggpubr_0.1.6
[13] matlab_1.0.2 ggrepel_0.7.0
[15] flexmix_2.3-14 bit64_0.9-7
[17] AnnotationDbi_1.38.2 mvtnorm_1.0-6
[19] xml2_1.1.1 codetools_0.2-15
[21] splines_3.4.2 R.methodsS3_1.7.1
[23] mnormt_1.5-5 doParallel_1.0.11
[25] DESeq_1.28.0 robustbase_0.92-8
[27] knitr_1.18 geneplotter_1.54.0
[29] jsonlite_1.5 Rsamtools_1.28.0
[31] km.ci_0.5-2 broom_0.4.3
[33] annotate_1.54.0 cluster_2.0.6
[35] kernlab_0.9-25 R.oo_1.21.0
[37] readr_1.1.1 compiler_3.4.2
[39] httr_1.3.1 assertthat_0.2.0
[41] Matrix_1.2-12 lazyeval_0.2.1
[43] limma_3.32.10 tools_3.4.2
[45] bindrcpp_0.2 gtable_0.2.0
[47] glue_1.2.0 GenomeInfoDbData_0.99.0
[49] reshape2_1.4.2 dplyr_0.7.4
[51] ggthemes_3.4.0 ShortRead_1.34.2
[53] Rcpp_0.12.14 Biobase_2.36.2
[55] trimcluster_0.1-2 Biostrings_2.44.2
[57] nlme_3.1-131 rtracklayer_1.36.6
[59] iterators_1.0.8 fpc_2.1-10
[61] psych_1.7.8 stringr_1.2.0
[63] rvest_0.3.2 XML_3.98-1.9
[65] dendextend_1.6.0 edgeR_3.18.1
[67] DEoptimR_1.0-8 zoo_1.8-0
[69] zlibbioc_1.22.0 MASS_7.3-47
[71] scales_0.5.0 aroma.light_3.6.0
[73] hms_0.4.0 parallel_3.4.2
[75] SummarizedExperiment_1.6.5 RColorBrewer_1.1-2
[77] curl_3.0 ComplexHeatmap_1.14.0
[79] memoise_1.1.0 gridExtra_2.3
[81] KMsurv_0.1-5 ggplot2_2.2.1
[83] downloader_0.4 biomaRt_2.32.1
[85] latticeExtra_0.6-28 stringi_1.1.6
[87] RSQLite_2.0 genefilter_1.58.1
[89] S4Vectors_0.14.7 foreach_1.4.3
[91] GenomicFeatures_1.28.5 BiocGenerics_0.22.1
[93] BiocParallel_1.10.1 shape_1.4.3
[95] GenomeInfoDb_1.12.3 rlang_0.1.6
[97] pkgconfig_2.0.1 prabclus_2.2-6
[99] matrixStats_0.52.2 bitops_1.0-6
[101] lattice_0.20-35 purrr_0.2.4
[103] bindr_0.1 cmprsk_2.2-7
[105] GenomicAlignments_1.12.2 bit_1.1-12
[107] plyr_1.8.4 magrittr_1.5
[109] R6_2.2.2 IRanges_2.10.5
[111] DelayedArray_0.2.7 DBI_0.7
[113] mgcv_1.8-22 foreign_0.8-69
[115] pillar_1.1.0 whisker_0.3-2
[117] survival_2.41-3 RCurl_1.95-4.8
[119] nnet_7.3-12 tibble_1.4.2
[121] EDASeq_2.10.0 survMisc_0.5.4
[123] viridis_0.4.0 GetoptLong_0.1.6
[125] locfit_1.5-9.1 grid_3.4.2
[127] sva_3.24.4 data.table_1.10.4-3
[129] blob_1.1.0 ConsensusClusterPlus_1.40.0
[131] digest_0.6.14 diptest_0.75-7
[133] xtable_1.8-2 tidyr_0.7.2
[135] R.utils_2.6.0 stats4_3.4.2
[137] munsell_0.4.3 viridisLite_0.2.0
[139] survminer_0.4.1
@tiagochst did you havd a glance at the code to check what is faliing?
Sorry, are you trying to download all clinical data?
The problem is with the execution of the GDC client program.
It says your /usr/lib64/libc.so.6: version
GLIBC_2.18' not found`
If you run ldd --version
on the temrinal you can check the version installed.
Your code, is working for me:
Hi @tiagochst , It seems I have a lower version:
$ ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
However, I'm running it on an HPC and I don't know if the admin would agree to update that library, since other apps could fail. Do you figure out another solution than upgrading system-wide glibc? I have btw CentOS 7.
Thanks for reply
Hi @tiagochst ,
do you think it's possible to change the code of gdc-client
so that it's also compatible with glibc
< 2.18?
You'll have to ask at: https://github.com/NCI-GDC/gdc-client
I've downloaded the gdc-client
script for CentOS, and it works fine. The problem I see here is the TCGAbiolinks downloads the script for Ubuntu regardless of the linux distribution you have. Is it changeable?
By the way, when I download the files with gdc-client
, it does not organize the files according to their "data type". In this way, I do not see GDCdata/TCGA-LUAD/legacy/Clinical/Tissue_slide_image/folder_with_svs_file
, but GDCdata/TCGA-LUAD/legacy/Clinical/folder_with_svs_file/
. I must change therefore the function GDCprepare_clinic
, so that it excludes the "data type" directory when reading the files. The following:
files <- file.path(query$project, source, gsub(" ", "_",
query$results[[1]]$data_category), gsub(" ", "_", query$results[[1]]$data_type),
gsub(" ", "_", query$results[[1]]$file_id), gsub(" ",
"_", query$results[[1]]$file_name))
would be converted to:
files <- file.path(query$project, source, gsub(" ", "_",
query$results[[1]]$data_category),
gsub(" ", "_", query$results[[1]]$file_id), gsub(" ",
"_", query$results[[1]]$file_name))
I added code to consider CentOS. For the code above, you should do data type by data type. I'll add a check if there is only data type before the download.
p <- 'TCGA-LUAD'
for(dtype in c("Clinical Supplement","Pathology report","Clinical data","Tissue slide image")){
clin.query <- GDCquery(project=p,
data.category='Clinical',
data.type = dtype,
legacy = TRUE,
barcode = "TCGA-78-8660")
GDCdownload(clin.query, method = "client")
}
Hi @tiagochst , I've reinstalled TCGAbiolinks
(devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")
) but it still downloads the wrong gdc-client
: gdc-client_v1.3.0_Ubuntu14.04_x64.zip
. Am I missing something?
I have inspected by the way the function GDCdownload
and R
does not recognise somehow the function GDCclientInstall
.
Thanks
There might be a bug.
What does the command Sys.info() returns?
Em 5 de fev de 2018, à(s) 09:10, jacorvar notifications@github.com escreveu:
Hi @tiagochst , I've reinstalled TCGAbiolinks (devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")) but it stils downloads the wrong gdc-client: gdc-client_v1.3.0_Ubuntu14.04_x64.zip. Am I missing something?
I have inspected by the way the function GDCdownload and R does not recognise somehow the function GDCclientInstall. Thanks
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@tiagochst
> Sys.info()
sysname release
"Linux" "3.10.0-514.el7.x86_64"
version nodename
"#1 SMP Tue Nov 22 16:42:41 UTC 2016" "nodo00"
machine login
"x86_64" "root"
user effective_user
"root" "root
Besides this bug, I found another one. When parsing the clinical information, not all the files downloaded via GDCdownload
display an xml format. Some of them are txt files. Therefore, using parseXML
function (https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/32df036da5d0430668fb7508a891c39527a42f3c/R/clinical.R#L285) to read all files stored in files
variable (all of this inside the funcion GDCprepare_clinic
) prompts the following error:
> clin <- parseXML(files, xpath, clinical.info)
| | 0%Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I want to download two bam files, which are WGS and very big, about 600 GB in total. I tried using
files.per.chunk = 100
The script I wrote is,library(TCGAbiolinks)
library(SummarizedExperiment)
library(dplyr)
library(DT)
GDCdownload(query, method = "api", files.per.chunk = 100)
but, I got following errors,
Is TCGAbiolinks able to download such huge data? And I want to download some controlled data, where should I input my account and password in this program? Thanks! Since the limitation of the system, I cannot use gdc-client, so I hope to use this software to download data. Hope somebody could help me.