BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
298 stars 112 forks source link

GDCdownload fails with "tar: unrecognized archive format" #221

Open benostendorf opened 6 years ago

benostendorf commented 6 years ago

Hi Tiago, my previously working script to download RNAseq expression data from GDC fails with the error message "tar: unrecognized archive format". I tried downloading with method set to client, but in this case the download stalls at a random point without giving an error. Can you reproduce this error or is it something on my end? Thank you! Benjamin


library(TCGAbiolinks)
project_name <- "TCGA-BRCA"
  query <- GDCquery(
    project = project_name,
    data.category = "Gene expression",
    data.type = "Gene expression quantification",
    platform = "Illumina HiSeq",
    experimental.strategy = "RNA-Seq",
    sample.type = c("Primary solid Tumor"), 
    file.type = "results",
    legacy = TRUE
    )

  GDCdownload(query,
              directory = "data/expression_data")

data <- 
  GDCprepare(query,
             save = TRUE,
             save.filename = NGS_raw,
             remove.files.prepared = FALSE, 
             directory = "data/expression_data"
             )

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] TCGAbiolinks_2.8.0

loaded via a namespace (and not attached): [1] colorspace_1.3-2 selectr_0.4-1 rjson_0.2.19
[4] hwriter_1.3.2 circlize_0.4.3 XVector_0.20.0
[7] GenomicRanges_1.32.3 GlobalOptions_0.0.13 ggpubr_0.1.6
[10] matlab_1.0.2 ggrepel_0.8.0 bit64_0.9-7
[13] AnnotationDbi_1.42.1 xml2_1.2.0 codetools_0.2-15
[16] splines_3.5.0 R.methodsS3_1.7.1 mnormt_1.5-5
[19] doParallel_1.0.11 DESeq_1.32.0 geneplotter_1.58.0
[22] knitr_1.20 jsonlite_1.5 Rsamtools_1.32.0
[25] km.ci_0.5-2 broom_0.4.4 annotate_1.58.0
[28] cluster_2.0.7-1 R.oo_1.22.0 readr_1.1.1
[31] compiler_3.5.0 httr_1.3.1 assertthat_0.2.0
[34] Matrix_1.2-14 lazyeval_0.2.1 limma_3.36.1
[37] prettyunits_1.0.2 tools_3.5.0 bindrcpp_0.2.2
[40] gtable_0.2.0 glue_1.2.0 GenomeInfoDbData_1.1.0
[43] reshape2_1.4.3 dplyr_0.7.5 ggthemes_3.5.0
[46] ShortRead_1.38.0 Rcpp_0.12.17 Biobase_2.40.0
[49] Biostrings_2.48.0 nlme_3.1-137 rtracklayer_1.40.2
[52] iterators_1.0.9 psych_1.8.4 stringr_1.3.1
[55] rvest_0.3.2 XML_3.98-1.11 edgeR_3.22.1
[58] zoo_1.8-1 zlibbioc_1.26.0 scales_0.5.0
[61] aroma.light_3.10.0 hms_0.4.2 parallel_3.5.0
[64] SummarizedExperiment_1.10.1 RColorBrewer_1.1-2 curl_3.2
[67] ComplexHeatmap_1.18.0 yaml_2.1.19 memoise_1.1.0
[70] gridExtra_2.3 KMsurv_0.1-5 ggplot2_2.2.1
[73] downloader_0.4 biomaRt_2.36.0 latticeExtra_0.6-28
[76] stringi_1.2.2 RSQLite_2.1.1 genefilter_1.62.0
[79] S4Vectors_0.18.2 foreach_1.4.4 GenomicFeatures_1.32.0
[82] BiocGenerics_0.26.0 BiocParallel_1.14.1 shape_1.4.4
[85] GenomeInfoDb_1.16.0 rlang_0.2.0 pkgconfig_2.0.1
[88] matrixStats_0.53.1 bitops_1.0-6 lattice_0.20-35
[91] purrr_0.2.4 bindr_0.1.1 GenomicAlignments_1.16.0
[94] cmprsk_2.2-7 bit_1.1-13 tidyselect_0.2.4
[97] plyr_1.8.4 magrittr_1.5 R6_2.2.2
[100] IRanges_2.14.10 DelayedArray_0.6.0 DBI_1.0.0
[103] mgcv_1.8-23 pillar_1.2.2 foreign_0.8-70
[106] survival_2.42-3 RCurl_1.95-4.10 tibble_1.4.2
[109] EDASeq_2.14.0 survMisc_0.5.4 GetoptLong_0.1.6
[112] progress_1.1.2 locfit_1.5-9.1 grid_3.5.0
[115] sva_3.28.0 data.table_1.11.2 blob_1.1.1
[118] ConsensusClusterPlus_1.44.0 digest_0.6.15 xtable_1.8-2
[121] tidyr_0.8.1 R.utils_2.6.0 stats4_3.5.0
[124] munsell_0.4.3 survminer_0.4.2

tiagochst commented 6 years ago

I'm trying to debug it. But it might be a bug in GDC legacy API. I'll add an workaround for the moment that I believe might make it work.

Could you install from github ? devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")

Also please, try using files.per.chunk argument. The code is trying download 1GB from GDC API at once, there is a huge chance it would fail. It is better to make smaller requests.

GDCdownload(query, directory = "data/expression_data",files.per.chunk = 50)

benostendorf commented 6 years ago

Thanks a lot for the immediate response Tiago - with the dev version it seems to work fine now (each chunk download fails at first with the message

ERROR accessing GDC. Trying again...

but then works upon retry).

tiagochst commented 6 years ago

Great! That's really weird. I'll send an email to GDC. There might be a problem in the legacy API.

tiagochst commented 6 years ago

Somehow it might be a R problem the python code is working quite well with the legacy API.

BirongZhang commented 3 years ago

Hi all,

I have the same problem. I am trying to use GDCquery, GDCdownload to some pathology images of TCGA-BRCA project. Even my version is TCGAbiolinks_2.19.0 (the latest version from GitHub) and R_4.0.3. Yesterday, I was able to download some.tar.gz files, but I couldn't open it. Today, I cannot even download any files.

Below is my code and corresponding output:

Screenshot 2021-06-25 at 19 22 08 Screenshot 2021-06-25 at 19 22 23 Screenshot 2021-06-25 at 19 43 06

Any advice would be greatly appreciated!Thanks!

bridget2617 commented 1 year ago

Somehow it might be a R problem the python code is working quite well with the legacy API.

I'm having this issue as well and using "devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")" did not fix the issue. Is there any other possible fix?