Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
26 stars 27 forks source link

Bug in scanTabix #8

Open timoast opened 5 years ago

timoast commented 5 years ago

Hi,

I came across a bug in scanTabix where no data is returned when requesting regions on double-digit chromosomes (ie >chr9). This only appears to be an issue on Windows and when the tabix file is above a certain size.

Here is a tabix file and index that will reproduce the issue. Apologies for the huge file, I tried a downsampling but the bug only seems to occur with the larger file.

Reproducible example:

library(Rsamtools)
library(GenomicRanges)
library(IRanges)

tbx.file <- "fragments.tsv.gz"
range.chr14 <- GRanges(seqnames = 'chr14', ranges = IRanges(start = 99635624, end = 99737861))
tbx <- TabixFile(file = tbx.file)
scanTabix(file = tbx, param = range.chr14)

This code will return data on macOS or linux but an empty vector on windows (I tested on Windows 7 with R 3.6.1 and the current version of Rsamtools).

bschilder commented 2 years ago

Was this resolved? I'm wondering if perhaps some of the other errors I'm experiencing are related to this (will post those soon).

mtmorgan commented 2 years ago

I think this is likely an integer overflow on Windows; I wonder if this occurs under the 64-bit build, especially under R-devel? This seems to be a regression introduced when we moved to using Rhtslib, but that transition is now quite old and it seems like the right thing to do is update Rhtslib, and then Rsamtools. Unfortunately, that is likely to be a moderate-to-big project and in the short to intermediate term the solution is likely to use Linux or macOS, e.g., via the Windows subsystem for Linux or, e.g., your local compute cluster or cloud provider.

bschilder commented 2 years ago

Thanks for the reply @mtmorgan, that's quite understandable.

Along those lines, an intermediate solution might be to use the Bioconductor Docker container, which is Linux-based and includes an Rstudio interface. We use this as a base for most of our Docker containers.

hpages commented 2 years ago

Reminds me of this Rhtslib Windows-specific bug from 2.5 years ago: https://support.bioconductor.org/p/124568/

Yes Rhtslib still contains HTSlib 1.7 which is lagging 4 years behind the latest HTSlib (version 1.15). Right thing to do at this point would be to update Rhtslib. Maybe that Windows-specific Tabix bug is gone in HTSlib 1.15, hopefully. However, as Martin said, this is a major endeavor. Not before BioC 3.16.

H.