Fail to download WGS bam files

HHHit commented 6 years ago

I want to download two bam files, which are WGS and very big, about 600 GB in total. I tried using files.per.chunk = 100 The script I wrote is, library(TCGAbiolinks) library(SummarizedExperiment) library(dplyr) library(DT)

query <- GDCquery(project = "TCGA-OV",
                data.category="Raw sequencing data",
                data.type="Aligned reads",
                platform="Illumina HiSeq",
                experimental.strategy = "WGS",
                barcode=c("TCGA-04-1331-01A-01D-A324-10","TCGA-04-1331-10A-01D-A324-10"),
                legacy = TRUE)

GDCdownload(query, method = "api", files.per.chunk = 100)

but, I got following errors,

Downloading` data for project TCGA-OV
GDCdownload will download 2 files. A total of 603.790759262 GB
Downloading chunk 1 of 1 (2 files, size = 603.790759262 GB) as Tue_Oct_31_11_10_30_2017_0.tar.gz
  |======================================================================| 100%
/usr/bin/gtar: This does not look like a tar archive

gzip: stdin: not in gzip format
/usr/bin/gtar: Child returned status 1
/usr/bin/gtar: Error is not recoverable: exiting now
Download completed
At least one of the chunks download was not correct. We will retry
Downloading chunk 1 of 1 (2 files, size = 603.790759262 GB) as Tue_Oct_31_11_10_30_2017_0.tar.gz
  |======================================================================| 100%
/usr/bin/gtar: This does not look like a tar archive

gzip: stdin: not in gzip format
/usr/bin/gtar: Child returned status 1
/usr/bin/gtar: Error is not recoverable: exiting now
Download completed
Error in if (ret == 1) break : argument is of length zero
Calls: GDCdownload ... tryCatchList -> tryCatchOne -> <Anonymous> -> GDCdownload.by.chunk
Execution halted

Is TCGAbiolinks able to download such huge data? And I want to download some controlled data, where should I input my account and password in this program? Thanks! Since the limitation of the system, I cannot use gdc-client, so I hope to use this software to download data. Hope somebody could help me.

tiagochst commented 6 years ago

This data is controlled, you need to set the token.file argument.

And you should use the GDC client tools as suggested by GDC when the data is too big. GDCdownload(query, method = "client", token.file = file)

tiagochst commented 6 years ago

TCGAbiolinks is also using gdc-client for those cases. I'm quite sure the API method will not work.

HHHit commented 6 years ago

Is there anyway that I could bypass the gdc-client? Since in the system I am required to run on, gdc-client cannot be installed.

tiagochst commented 6 years ago

I don't think so.

HHHit commented 6 years ago

Ok, thanks for the reply.

jacorvar commented 6 years ago

Hi, I'm running into the same issue, no matter how big the chunks are:

Downloading data for project TCGA-LUAD
GDCdownload will download 2167 files. A total of 214.444281679 GB
Downloading chunk 1 of 434 (5 files, size = 404.137405 MB) as Wed_Jan_24_13_36_06_2018_0.tar.gz
Downloading: 350 MB     At least one of the chunks download was not correct. We will retry
Downloading chunk 1 of 434 (5 files, size = 404.137405 MB) as Wed_Jan_24_13_36_06_2018_0.tar.gz
Downloading: 350 MB     Error in GDCdownload.aux(server, manifest.aux, name.aux, path) : 
  There was an error in the download process (we might had a connection problem with GDC server).
Please run this function it again.
Try using method = `client` or setting files.per.chunk to a small number.

Of course I've tried with method='client', but then I get the following error:

Downloading data for project TCGA-LUAD
probando la URL 'https://gdc.cancer.gov/system/files/authenticated%20user/0/gdc-client_v1.3.0_Ubuntu14.04_x64.zip'
Content type 'application/zip' length 24647158 bytes (23.5 MB)
==================================================
downloaded 23.5 MB

GDCdownload will download: 214.444281679 GB
Executing GDC client with the following command:
./gdc-client download -m gdc_manifest.txt
Traceback (most recent call last):
  File "gdc-client", line 7, in <module>

  File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
  File "build/bdist.linux-x86_64/egg/gdc_client/upload/__init__.py", line 1, in <module>
  File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
  File "build/bdist.linux-x86_64/egg/gdc_client/upload/parser.py", line 12, in <module>
  File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 389, in load_module
  File "build/bdist.linux-x86_64/egg/gdc_client/upload/client.py", line 6, in <module>
  File "/tmp/pip-build-TioGfr/PyInstaller/PyInstaller/loader/pyimod03_importers.py", line 546, in load_module
ImportError: /usr/lib64/libc.so.6: version `GLIBC_2.18' not found (required by /tmp/_MEI2BYIr4/libstdc++.so.6)
Failed to execute script gdc-client
Error in move(i, file.path(path, i)) : 
  I could not find the file: a11a4f1e-3b68-4c0e-8259-7ab630658a7c

Any hints on this issue?

tiagochst commented 6 years ago

Please, could you send your sessionInfo() and the query and download code you are using?

jacorvar commented 6 years ago

Sure. Query and download code:

p <- 'TCGA-LUAD'
clin.query <- GDCquery(project=p, data.category='Clinical', legacy = TRUE)
GDCdownload(clin.query, method = "client")

sessionInfo():

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /mnt/beegfs/soft/R/R-3.4.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/beegfs/soft/R/R-3.4.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TCGAbiolinks_2.7.19

loaded via a namespace (and not attached):
  [1] colorspace_1.3-2            selectr_0.3-1              
  [3] rjson_0.2.15                hwriter_1.3.2              
  [5] class_7.3-14                modeltools_0.2-21          
  [7] mclust_5.4                  circlize_0.4.3             
  [9] XVector_0.16.0              GenomicRanges_1.28.6       
 [11] GlobalOptions_0.0.12        ggpubr_0.1.6               
 [13] matlab_1.0.2                ggrepel_0.7.0              
 [15] flexmix_2.3-14              bit64_0.9-7                
 [17] AnnotationDbi_1.38.2        mvtnorm_1.0-6              
 [19] xml2_1.1.1                  codetools_0.2-15           
 [21] splines_3.4.2               R.methodsS3_1.7.1          
 [23] mnormt_1.5-5                doParallel_1.0.11          
 [25] DESeq_1.28.0                robustbase_0.92-8          
 [27] knitr_1.18                  geneplotter_1.54.0         
 [29] jsonlite_1.5                Rsamtools_1.28.0           
 [31] km.ci_0.5-2                 broom_0.4.3                
 [33] annotate_1.54.0             cluster_2.0.6              
 [35] kernlab_0.9-25              R.oo_1.21.0                
 [37] readr_1.1.1                 compiler_3.4.2             
 [39] httr_1.3.1                  assertthat_0.2.0           
 [41] Matrix_1.2-12               lazyeval_0.2.1             
 [43] limma_3.32.10               tools_3.4.2                
 [45] bindrcpp_0.2                gtable_0.2.0               
 [47] glue_1.2.0                  GenomeInfoDbData_0.99.0    
 [49] reshape2_1.4.2              dplyr_0.7.4                
 [51] ggthemes_3.4.0              ShortRead_1.34.2           
 [53] Rcpp_0.12.14                Biobase_2.36.2             
 [55] trimcluster_0.1-2           Biostrings_2.44.2          
 [57] nlme_3.1-131                rtracklayer_1.36.6         
 [59] iterators_1.0.8             fpc_2.1-10                 
 [61] psych_1.7.8                 stringr_1.2.0              
 [63] rvest_0.3.2                 XML_3.98-1.9               
 [65] dendextend_1.6.0            edgeR_3.18.1               
 [67] DEoptimR_1.0-8              zoo_1.8-0                  
 [69] zlibbioc_1.22.0             MASS_7.3-47                
 [71] scales_0.5.0                aroma.light_3.6.0          
 [73] hms_0.4.0                   parallel_3.4.2             
 [75] SummarizedExperiment_1.6.5  RColorBrewer_1.1-2         
 [77] curl_3.0                    ComplexHeatmap_1.14.0      
 [79] memoise_1.1.0               gridExtra_2.3              
 [81] KMsurv_0.1-5                ggplot2_2.2.1              
 [83] downloader_0.4              biomaRt_2.32.1             
 [85] latticeExtra_0.6-28         stringi_1.1.6              
 [87] RSQLite_2.0                 genefilter_1.58.1          
 [89] S4Vectors_0.14.7            foreach_1.4.3              
 [91] GenomicFeatures_1.28.5      BiocGenerics_0.22.1        
 [93] BiocParallel_1.10.1         shape_1.4.3                
 [95] GenomeInfoDb_1.12.3         rlang_0.1.6                
 [97] pkgconfig_2.0.1             prabclus_2.2-6             
 [99] matrixStats_0.52.2          bitops_1.0-6               
[101] lattice_0.20-35             purrr_0.2.4                
[103] bindr_0.1                   cmprsk_2.2-7               
[105] GenomicAlignments_1.12.2    bit_1.1-12                 
[107] plyr_1.8.4                  magrittr_1.5               
[109] R6_2.2.2                    IRanges_2.10.5             
[111] DelayedArray_0.2.7          DBI_0.7                    
[113] mgcv_1.8-22                 foreign_0.8-69             
[115] pillar_1.1.0                whisker_0.3-2              
[117] survival_2.41-3             RCurl_1.95-4.8             
[119] nnet_7.3-12                 tibble_1.4.2               
[121] EDASeq_2.10.0               survMisc_0.5.4             
[123] viridis_0.4.0               GetoptLong_0.1.6           
[125] locfit_1.5-9.1              grid_3.4.2                 
[127] sva_3.24.4                  data.table_1.10.4-3        
[129] blob_1.1.0                  ConsensusClusterPlus_1.40.0
[131] digest_0.6.14               diptest_0.75-7             
[133] xtable_1.8-2                tidyr_0.7.2                
[135] R.utils_2.6.0               stats4_3.4.2               
[137] munsell_0.4.3               viridisLite_0.2.0          
[139] survminer_0.4.1

jacorvar commented 6 years ago

@tiagochst did you havd a glance at the code to check what is faliing?

tiagochst commented 6 years ago

Sorry, are you trying to download all clinical data?

The problem is with the execution of the GDC client program. It says your /usr/lib64/libc.so.6: versionGLIBC_2.18' not found`

If you run ldd --version on the temrinal you can check the version installed.

Your code, is working for me:

jacorvar commented 6 years ago

Hi @tiagochst , It seems I have a lower version:

$ ldd --version
ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.

However, I'm running it on an HPC and I don't know if the admin would agree to update that library, since other apps could fail. Do you figure out another solution than upgrading system-wide glibc? I have btw CentOS 7.

Thanks for reply

jacorvar commented 6 years ago

Hi @tiagochst ,

do you think it's possible to change the code of gdc-client so that it's also compatible with glibc < 2.18?

tiagochst commented 6 years ago

You'll have to ask at: https://github.com/NCI-GDC/gdc-client

jacorvar commented 6 years ago

I've downloaded the gdc-client script for CentOS, and it works fine. The problem I see here is the TCGAbiolinks downloads the script for Ubuntu regardless of the linux distribution you have. Is it changeable?

jacorvar commented 6 years ago

By the way, when I download the files with gdc-client, it does not organize the files according to their "data type". In this way, I do not see GDCdata/TCGA-LUAD/legacy/Clinical/Tissue_slide_image/folder_with_svs_file, but GDCdata/TCGA-LUAD/legacy/Clinical/folder_with_svs_file/. I must change therefore the function GDCprepare_clinic, so that it excludes the "data type" directory when reading the files. The following:

files <- file.path(query$project, source, gsub(" ", "_", 
        query$results[[1]]$data_category), gsub(" ", "_", query$results[[1]]$data_type), 
        gsub(" ", "_", query$results[[1]]$file_id), gsub(" ", 
            "_", query$results[[1]]$file_name))

would be converted to:

files <- file.path(query$project, source, gsub(" ", "_", 
        query$results[[1]]$data_category),
        gsub(" ", "_", query$results[[1]]$file_id), gsub(" ", 
            "_", query$results[[1]]$file_name))

tiagochst commented 6 years ago

I added code to consider CentOS. For the code above, you should do data type by data type. I'll add a check if there is only data type before the download.

p <- 'TCGA-LUAD'
for(dtype in c("Clinical Supplement","Pathology report","Clinical data","Tissue slide image")){
  clin.query <- GDCquery(project=p, 
                         data.category='Clinical',
                         data.type = dtype, 
                         legacy = TRUE, 
                         barcode = "TCGA-78-8660")
  GDCdownload(clin.query, method = "client")
}

jacorvar commented 6 years ago

Hi @tiagochst , I've reinstalled TCGAbiolinks (devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")) but it still downloads the wrong gdc-client: gdc-client_v1.3.0_Ubuntu14.04_x64.zip. Am I missing something?

I have inspected by the way the function GDCdownload and R does not recognise somehow the function GDCclientInstall. Thanks

tiagochst commented 6 years ago

There might be a bug.

What does the command Sys.info() returns?

Em 5 de fev de 2018, à(s) 09:10, jacorvar notifications@github.com escreveu:

Hi @tiagochst , I've reinstalled TCGAbiolinks (devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")) but it stils downloads the wrong gdc-client: gdc-client_v1.3.0_Ubuntu14.04_x64.zip. Am I missing something?

I have inspected by the way the function GDCdownload and R does not recognise somehow the function GDCclientInstall. Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jacorvar commented 6 years ago

@tiagochst

> Sys.info()
                              sysname                               release 
                              "Linux"               "3.10.0-514.el7.x86_64" 
                              version                              nodename 
"#1 SMP Tue Nov 22 16:42:41 UTC 2016"                              "nodo00" 
                              machine                                 login 
                             "x86_64"                                "root" 
                                 user                        effective_user 
                               "root"                                "root

jacorvar commented 6 years ago

Besides this bug, I found another one. When parsing the clinical information, not all the files downloaded via GDCdownload display an xml format. Some of them are txt files. Therefore, using parseXML function (https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/32df036da5d0430668fb7508a891c39527a42f3c/R/clinical.R#L285) to read all files stored in files variable (all of this inside the funcion GDCprepare_clinic) prompts the following error:

> clin <- parseXML(files, xpath, clinical.info)
  |                                                                      |   0%Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
  Start tag expected, '<' not found [4]

BioinformaticsFMRP / TCGAbiolinks

Fail to download WGS bam files #162