BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
290 stars 111 forks source link

Download stopping due to tar-file issue #534

Open ChrisAi89 opened 2 years ago

ChrisAi89 commented 2 years ago

Dear all,

I am trying to download the tissue slides for the TCGA-READ and TCGA-COAD project, but I am confronted with /bin/tar: This does not look like a tar archive. From my point of view it looks like a problem with tarand gzip, but it works fine with other projects. As download method, I am using api, client does not show any data for the TCGA-READ project. Is there a reason for that?

Thanks and all the best!

Follwing the output and the session info.

Output: Downloading data for project TCGA-READ Of the 530 files for download 72 already exist. We will download only those that are missing ones. GDCdownload will download 458 files. A total of 121.391636198 GB The total size of files is big. We will download files in chunks Downloading chunk 1 of 153 (3 files, size = 890.805743 MB) as Sat_Aug_27_09_12_54_2022_0.tar.gz |======================================================================| 100% /bin/tar: This does not look like a tar archive

gzip: stdin: not in gzip format /bin/tar: Child returned status 1 /bin/tar: Error is not recoverable: exiting now Download completed At least one of the chunks download was not correct. We will retry Downloading chunk 1 of 153 (3 files, size = 890.805743 MB) as Sat_Aug_27_09_12_54_2022_0.tar.gz |======================================================================| 100% /bin/tar: This does not look like a tar archive

gzip: stdin: not in gzip format /bin/tar: Child returned status 1 /bin/tar: Error is not recoverable: exiting now Download completed Error in if (ret == 1) break : argument is of length zero Calls: GDCdownload ... tryCatchList -> tryCatchOne -> -> GDCdownload.by.chunk Execution halted

Session Info: R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] TCGAbiolinks_2.24.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.9 lattice_0.20-45
[3] tidyr_1.2.0 prettyunits_1.1.1
[5] png_0.1-7 Biostrings_2.64.1
[7] assertthat_0.2.1 digest_0.6.29
[9] utf8_1.2.2 BiocFileCache_2.4.0
[11] plyr_1.8.7 R6_2.5.1
[13] GenomeInfoDb_1.32.3 stats4_4.2.1
[15] RSQLite_2.2.16 httr_1.4.4
[17] ggplot2_3.3.6 pillar_1.8.1
[19] zlibbioc_1.42.0 rlang_1.0.4
[21] progress_1.2.2 curl_4.3.2
[23] data.table_1.14.2 blob_1.2.3
[25] S4Vectors_0.34.0 Matrix_1.4-1
[27] downloader_0.4 readr_2.1.2
[29] stringr_1.4.1 RCurl_1.98-1.8
[31] bit_4.0.4 biomaRt_2.52.0
[33] munsell_0.5.0 DelayedArray_0.22.0
[35] xfun_0.32 compiler_4.2.1
[37] pkgconfig_2.0.3 BiocGenerics_0.42.0
[39] tidyselect_1.1.2 KEGGREST_1.36.3
[41] SummarizedExperiment_1.26.1 tibble_3.1.8
[43] GenomeInfoDbData_1.2.8 IRanges_2.30.1
[45] matrixStats_0.62.0 XML_3.99-0.10
[47] fansi_1.0.3 crayon_1.5.1
[49] dplyr_1.0.9 tzdb_0.3.0
[51] dbplyr_2.2.1 rappdirs_0.3.3
[53] bitops_1.0-7 grid_4.2.1
[55] jsonlite_1.8.0 gtable_0.3.0
[57] lifecycle_1.0.1 DBI_1.1.3
[59] magrittr_2.0.3 scales_1.2.1
[61] cli_3.3.0 TCGAbiolinksGUI.data_1.16.0 [63] stringi_1.7.8 cachem_1.0.6
[65] XVector_0.36.0 xml2_1.3.3
[67] filelock_1.0.2 ellipsis_0.3.2
[69] generics_0.1.3 vctrs_0.4.1
[71] tools_4.2.1 bit64_4.0.5
[73] Biobase_2.56.0 glue_1.6.2
[75] purrr_0.3.4 hms_1.1.2
[77] MatrixGenerics_1.8.1 fastmap_1.1.0
[79] AnnotationDbi_1.58.0 colorspace_2.0-3
[81] GenomicRanges_1.48.0 rvest_1.0.3
[83] memoise_2.0.1 knitr_1.40

tiagochst commented 2 years ago

@ChrisAi89 Please, could you post the query code.

ChrisAi89 commented 2 years ago

@tiagochst Thnaks for the quick reply. Following my code:

rm(list = ls())
gc()

library("TCGAbiolinks")

################################################################################
# Get Histoslides for COAD for HG38 harmonized data
################################################################################
setwd("/mnt/project-data-0/")

projects <- getGDCprojects()
projects <- projects[,1, drop = TRUE]

for(i in projects){
  cat(i,"\n")
  query_histo_slides <- try(GDCquery(project = i,
  data.category = "Biospecimen", data.type = "Slide Image"),silent = TRUE)
    if(class(query_histo_slides) == "try-error"){
       cat(i," - Not Slide Images\n")
       next
    }
       else{GDCdownload(query = query_histo_slides, method = "api")
    }

    rm(query_histo_slides)
}
ChrisAi89 commented 2 years ago

Hi @tiagochst,

did you manage to reproduce my described code-behavior? Just would like to know whether I have to find a different way or not.

All the best, Chris

tiagochst commented 2 years ago

Yes I was able to, but I will need to contact GDC. Here is the problematic file: https://portal.gdc.cancer.gov/files/9412af8e-3c00-44de-a29b-b4801b65ca42

On Tue, Sep 6, 2022 at 1:01 PM ChrisAi89 @.***> wrote:

Hi @tiagochst https://github.com/tiagochst,

did you manage to reproduce my described code-behavior? Just would like to know whether I have to find a different way or not.

All the best, Chris

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/534#issuecomment-1238423501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6ME2CBMBIOGGI4UAWDV452FNANCNFSM57ZHDVDA . You are receiving this because you were mentioned.Message ID: @.***>