Bioconductor / Rhtslib

HTSlib high-throughput sequencing library as an R package
https://bioconductor.org/packages/Rhtslib
11 stars 11 forks source link

Unexpected EOF in archive, when installing via renv #37

Open fdekievit opened 2 weeks ago

fdekievit commented 2 weeks ago

Dear Authors,

I have an error which I've encountered when trying to install Rhtslib (and others) via Docker on Mac (M3). I have raised this issue on the renv repo, but people have suggested the issue also persists for them not using Renv, and it might be an issue with Apple machines (although this needs to be confirmed later). Issue link on renv: https://github.com/rstudio/renv/issues/1957

In short, when trying to install via Docker, I get the following error:

/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
/usr/bin/tar xf '/root/.cache/R/renv/source/repository/Rhtslib/Rhtslib_3.0.0.tar.gz' -C '/tmp/RtmpWMgcFb/renv-description-77e9f8ada' 'Rhtslib/DESCRIPTION'

To avoid duplication, please see the issue linked above.

Can you tell me if this is this an Rhtslib issue, or an renv issue?

Edit: A suggestion was made that this might be Mac related, as such I've tried to enforce the Dockerfile to use the linux/amd64 platform instead. However, both me and another user who replied use Macs.

kevinushey commented 2 weeks ago

For me, the following was a reproducible example:

options(repos = BiocManager::repositories())
#> 'getOption("repos")' replaces Bioconductor standard repositories, see
#> 'help("repositories", package = "BiocManager")' for details.
#> Replacement repositories:
#>    CRAN: https://p3m.dev/cran/__linux__/jammy/latest
dl <- download.packages("Rhtslib", destdir = getwd())
#> trying URL 'https://bioconductor.org/packages/3.19/container-binaries/bioconductor_docker/src/contrib/Rhtslib_3.0.0_R_x86_64-pc-linux-gnu.tar.gz'
#> Content type 'application/gzip' length 7166712 bytes (6.8 MB)
#> ==================================================
#> downloaded 6.8 MB

system2("/usr/bin/tar", c("xf", dl[1, 2], "-C", getwd(), "Rhtslib/DESCRIPTION"))
#> /usr/bin/tar: Unexpected EOF in archive
#> /usr/bin/tar: Error is not recoverable: exiting now

Or perhaps more concretely?

url <- "https://bioconductor.org/packages/3.19/container-binaries/bioconductor_docker/src/contrib/Rhtslib_3.0.0_R_x86_64-pc-linux-gnu.tar.gz"
download.file(url, destfile = basename(url))
system2("/usr/bin/tar", c("xf", basename(url), "-C", getwd(), "Rhtslib/DESCRIPTION"))

For what it's worth, it seems only the binary package is affected; the source package doesn't run into this same issue.

hpages commented 2 weeks ago

@almahmoud Any thoughts on this?

almahmoud commented 2 weeks ago

Sorry for the delay in a response, this is a weird one... It's not an issue with the binaries, or platform incompatibility (originally thought it was due to using the binaries in an arm container, but saw in the other issue that the platform was specified so would use an emulator even if on M3 chip), then it seemed to be due to the fact that the URLs, when they are pointing to precompiled binaries in the container, go through a redirect from the Bioc site's apache server to the buckets where the binaries are hosted. That fixes the issue with @kevinushey 's example by adding extra=-L

eg not working

> download.file(url, destfile = basename(url), method="curl")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   424  100   424    0     0    638      0 --:--:-- --:--:-- --:--:--   642
> system2("/usr/bin/tar", c("-xvf", basename(url)))
/usr/bin/tar: This does not look like a tar archive

gzip: stdin: not in gzip format
/usr/bin/tar: Child returned status 1
/usr/bin/tar: Error is not recoverable: exiting now

eg working

> download.file(url, destfile = basename(url), method="curl", extra="-L")
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   424  100   424    0     0    931      0 --:--:-- --:--:-- --:--:--   950
100 6998k  100 6998k    0     0  4204k      0  0:00:01  0:00:01 --:--:-- 5992k
> system2("/usr/bin/tar", c("-xvf", basename(url)))
Rhtslib/DESCRIPTION
Rhtslib/INDEX
Rhtslib/Meta/
Rhtslib/Meta/Rd.rds
Rhtslib/Meta/features.rds
Rhtslib/Meta/hsearch.rds
Rhtslib/Meta/links.rds
[...]

However, there seems to be another issue with how renv is doing it, as even downloading the correct binary after following the redirect, it still errors out on the tar command with specifying a file to get out. However, I can manually untar the whole thing from the cache, so the issue seems to be something else here... I have no idea why it works untaring the whole thing but not specifying a file...

> renv::install('bioc::Rhtslib')

[...]

https://rstudio.github.io/renv/.
Do you want to proceed? [y/N]: y

- "~/.cache/R/renv" has been created.
# Downloading packages -------------------------------------------------------
- Downloading Rhtslib from BioCcontainers ...   OK [6.8 Mb in 1.8s]
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
/usr/bin/tar xf '/root/.cache/R/renv/source/repository/Rhtslib/Rhtslib_3.0.0.tar.gz' -C '/tmp/RtmpjrkuRI/renv-description-f72ac72e2' 'Rhtslib/DESCRIPTION'
================================================================================

/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now

Error: error decompressing archive [error code 2]
Traceback (most recent calls last):
18: renv::install("bioc::Rhtslib")
17: retrieve(packages)
16: handler(package, renv_retrieve_impl(package))
15: renv_retrieve_impl(package)
14: renv_retrieve_bioconductor(record)
13: renv_retrieve_repos(record)
12: renv_retrieve_repos_impl(record)
11: renv_retrieve_package(record, url, path)
10: renv_retrieve_successful(record, path)
 9: renv_description_read(path, subdir = subdir)
 8: filebacked(context = "renv_description_read", path = path, callback = renv_description_read_impl,
        subdir = subdir, ...)
 7: callback(path, ...)
 6: renv_archive_decompress(path, files = file, exdir = exdir)
 5: renv_archive_decompress_tar(archive, files = files, exdir = exdir,
        ...)
 4: renv_tar_decompress(tar, archive = archive, files = files, exdir = exdir,
        ...)
 3: renv_system_exec(tar, args, action = "decompressing archive")
 2: abort(sprintf("error %s [error code %i]", action, status), body = renv_system_exec_details(command,
        args, output))
 1: stop(fallback)

> system2("/usr/bin/tar", c("-xvf", "/root/.cache/R/renv/source/repository/Rhtslib/Rhtslib_3.0.0.tar.gz"))
Rhtslib/DESCRIPTION
Rhtslib/INDEX
Rhtslib/Meta/
Rhtslib/Meta/Rd.rds
Rhtslib/Meta/features.rds
Rhtslib/Meta/hsearch.rds
Rhtslib/Meta/links.rds
Rhtslib/Meta/nsInfo.rds
[...]

Will look more into it tomorrow...

hpages commented 1 week ago

The binary tarball is corrupted as showed by @kevinushey's last "more concretely" example. To be even more concrete I can reproduce this from the Unix shell with:

hpages@XPS15:~$ wget https://bioconductor.org/packages/3.19/container-binaries/bioconductor_docker/src/contrib/Rhtslib_3.0.0_R_x86_64-pc-linux-gnu.tar.gz

hpages@XPS15:~$ tar ztf Rhtslib_3.0.0_R_x86_64-pc-linux-gnu.tar.gz 
Rhtslib/DESCRIPTION
Rhtslib/INDEX
...
Rhtslib/testdata/xx.fa.fai
Rhtslib/usrlib/
Rhtslib/usrlib/libhts.a
Rhtslib/usrlib/libhts.so
Rhtslib/usrlib/libhts.so.2
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

The interesting part here is that the listing produced by tar ztf is truncated right after Rhtslib/usrlib/libhts.so.2 which is a symlink to Rhtslib/usrlib/libhts.so.

At the root of the problem is a bug in utils::tar() that seems to produce a corrupted tarball on a directory that contains symlinks, at least on Linux. For example, on my Ubuntu 23.10 laptop, granted that Rhtslib is already installed:

dir_with_symlinks <- system.file(package="Rhtslib", "usrlib")

system2("ls", c("-l", dir_with_symlinks))
# total 11660
# -rw-rw-r-- 1 hpages hpages 7627652 Sep  2 10:33 libhts.a
# -rwxrwxr-x 1 hpages hpages 4305168 Sep  2 10:33 libhts.so
# lrwxrwxrwx 1 hpages hpages       9 Sep  2 10:33 libhts.so.2 -> libhts.so

utils::tar("test.tar.gz", dir_with_symlinks, compression="gzip", tar="")

system2("/usr/bin/tar", c("ztf", "test.tar.gz"))
# /usr/bin/tar: Removing leading `/' from member names
# /home/hpages/R/R-4.4.0/site-library/Rhtslib/usrlib/libhts.a
# /home/hpages/R/R-4.4.0/site-library/Rhtslib/usrlib/libhts.so
# /home/hpages/R/R-4.4.0/site-library/Rhtslib/usrlib/libhts.so.2
# /usr/bin/tar: Unexpected EOF in archive
# /usr/bin/tar: Error is not recoverable: exiting now

Problem is that utils::tar() is used by the R CMD INSTALL --build path/to/pkg/source/tarball command on Unix at the end of the installation sequence to tar up whatever ended up in the installation folder. See https://github.com/wch/r-source/blob/e6285ef6acfdbc7b4cebbbdf4727e7196133e3c3/src/library/tools/R/install.R#L436-L438

This would need to be reported to the R core team. In the mean time an easy workaround is to set environment variable R_INSTALL_TAR to /usr/bin/tar or to whatever the path to the tar command is on the machine where R CMD INSTALL --build path/to/pkg/source/tarball is run. This will force utils::tar() to use that instead of its broken internal implementation.

@almahmoud Can we set R_INSTALL_TAR on the machines where those package binaries are produced? Then regenerate the binaries. Thanks

hpages commented 1 week ago

@almahmoud I don't think many R package source tarballs contain symlinks. I'm not sure it's even a good idea to produce R package source tarballs with symlinks, but that's kind of an orthogonal story. All this to say that maybe Rhtslib is the only binary that needs to be regenerated, in which case I could just bump its version in release and devel after you've set R_INSTALL_TAR on the binary builders.

Also note that a simple way to programmatically detect corrupted binaries is with:

bin_tarball=pkgname_X.Y.Z_R_x86_64-pc-linux-gnu.tar.gz
tar ztf $bin_tarball >/dev/null
if [ $? -ne 0 ]; then
    echo "ERROR: $bin_tarball is corrupted"
fi

I don't know if this is something that you could maybe add to the script that generates these binaries, if it's not too hard to do? This would give us an idea of how many packages are currently affected and help us avoid generating corrupted binaries in the future.

kevinushey commented 1 week ago

Thanks for the investigation @hpages -- I was able to put together a reproducible example using the discussion here as inspiration, and filed an issue for R Core at https://bugs.r-project.org/show_bug.cgi?id=18790.

hpages commented 1 week ago

Awesome! Thanks for doing that @kevinushey