Bioconductor / IRanges

Foundation of integer range manipulation in Bioconductor
https://bioconductor.org/packages/IRanges
21 stars 10 forks source link

Some methods expect no more than 2^31 elements #43

Closed ssayols closed 2 years ago

ssayols commented 2 years ago

Hi Michael, Herve, I'm trying to fetch the [human] genome-wide coverage of a CompressedRleList using a combination of coverage() from the GenomicRanges package and [ from IRanges (well probably [ is from S4Vectors but extended in IRanges?). I realized this is not possible, as it seems that some functions in IRanges expect the number of elements to fit in a 32-bit integer.

I'm most likely abusing the original use-case which these functions were designed for. Still, I wanted to ask whether this limitation with 32-bit integers is something known and intentionally meant to be there.

Here's a minimal example (get the average coverage of 1M bp bins):

library(GenomicRanges)

# get the average coverage of 1M bp bins
bins <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), tilewidth=1e6, cut.last.tile=TRUE)
bins <- keepStandardChromosomes(bins, pruning.mode="coarse")
x <- GRanges("chr1:1")   # a phony signal track
seqinfo(x) <- seqinfo(bins)

bincov <- coverage(x)[bins]    # <-- this will raise an error
bin.avg.cov <- lapply(bincov, function(x) mean(decode(x)))

This seems to work fine with organisms with shorter genomes (< 2^31 bp, like Scerevisiase).

The methods in question which expect 32 bit integers are bindROWS() from the CompressedList class (produce an integer overflow in a cumsum() call), and PartitioningByEnd() which internally calls .numeric2end() and does implicit conversions to integer type.

I'm currently using Bioconductor 3.14 through Bioconductor's official Docker image:

R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  utils     datasets  grDevices methods   base     

other attached packages:
[1] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0  IRanges_2.28.0       S4Vectors_0.32.0     BiocGenerics_0.40.0 

loaded via a namespace (and not attached):
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 rstudioapi_0.13                   XVector_0.34.0                   
 [4] GenomicAlignments_1.30.0          zlibbioc_1.40.0                   BiocParallel_1.28.0              
 [7] lattice_0.20-45                   BSgenome_1.62.0                   rjson_0.2.20                     
[10] tools_4.1.1                       grid_4.1.1                        SummarizedExperiment_1.24.0      
[13] parallel_4.1.1                    Biobase_2.54.0                    matrixStats_0.61.0               
[16] yaml_2.2.1                        crayon_1.4.1                      BiocIO_1.4.0                     
[19] Matrix_1.3-4                      GenomeInfoDbData_1.2.7            rtracklayer_1.54.0               
[22] restfulr_0.0.13                   bitops_1.0-7                      RCurl_1.98-1.5                   
[25] DelayedArray_0.20.0               compiler_4.1.1                    MatrixGenerics_1.6.0             
[28] Biostrings_2.62.0                 Rsamtools_2.10.0                  XML_3.99-0.8            

Many thanks for your feedback.

hpages commented 2 years ago

Hi,

It's not just IRanges or GRanges objects. Long Vector derivatives are not supported at the moment. More precisely: the length of any Vector derivative must be <= .Machine$int.max (i.e. <= 2^31 - 1). This includes CharacterList, IntegerList, IRanges, GRanges, GPos, GRangesList, DNAString, DNAStringSet, SummarizedExperiment, and many more Vector derivatives.

BTW the error you get with your minimal example above has nothing to do with this:

library(BSgenome.Hsapiens.UCSC.hg38)
bins <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), tilewidth=1e6, cut.last.tile=TRUE)
bins <- keepStandardChromosomes(bins, pruning.mode="coarse")
x <- GRanges("chr1:1")
seqinfo(x) <- seqinfo(bins)
x[bins]
# Error in .subset_by_GenomicRanges(x, i) : 
#   'x' must have names when subsetting by a GenomicRanges subscript

I'm not sure what you're trying to achieve here, that code doesn't make much sense. But I would hope that the error message is quite clear about what the problem is.

FWIW here is a simple example that reveals the .Machine$int.max limitation for Vector derivatives:

hg38chromranges <- GRanges(seqinfo(BSgenome.Hsapiens.UCSC.hg38))
hg38chromranges
# GRanges object with 640 ranges and 0 metadata columns:
#                                    seqnames      ranges strand
#                                      <Rle>   <IRanges>  <Rle>
#                   chr1                 chr1 1-248956422      *
#                   chr2                 chr2 1-242193529      *
#                   chr3                 chr3 1-198295559      *
#                   chr4                 chr4 1-190214555      *
#                   chr5                 chr5 1-181538259      *
#                    ...                  ...         ...    ...
#   chr22_KN196486v1_alt chr22_KN196486v1_alt    1-153027      *
#   chr22_KQ458387v1_alt chr22_KQ458387v1_alt    1-155930      *
#   chr22_KQ458388v1_alt chr22_KQ458388v1_alt    1-174749      *
#   chr22_KQ759761v1_alt chr22_KQ759761v1_alt    1-145162      *
#    chrX_KV766199v1_alt  chrX_KV766199v1_alt    1-188004      *
#   -------
#   seqinfo: 640 sequences (1 circular) from hg38 genome

GPos(hg38chromranges[1:3])
# StitchedGPos object with 689445510 positions and 0 metadata columns:
#               seqnames       pos strand
#                  <Rle> <integer>  <Rle>
#           [1]     chr1         1      *
#           [2]     chr1         2      *
#           [3]     chr1         3      *
#           [4]     chr1         4      *
#           [5]     chr1         5      *
#           ...      ...       ...    ...
#   [689445506]     chr3 198295555      *
#   [689445507]     chr3 198295556      *
#   [689445508]     chr3 198295557      *
#   [689445509]     chr3 198295558      *
#   [689445510]     chr3 198295559      *
#   -------
#   seqinfo: 640 sequences (1 circular) from hg38 genome

GPos(hg38chromranges)
# Error in new_StitchedIPos(pos) : too many positions in 'pos'

A typical way to work around this limitation is to work at the chromosome level instead of whole genome level, and to combine the results. In my experience this is usually not too hard to do and it generally works quite well. It can also improve performance which is not surprising since this is what happens in general when breaking the data into small chunks.

Hope this helps, H.

ssayols commented 2 years ago

Hi Herve, Sorry there was a mistake in the minimal example, exactly in the key line throwing the error. I edited it accordingly, hope now it's clear what I meant. But in any case your answer is still valid and means currently there's no way to have Vectors larger than 2^31.

Many thanks for the feedback! S.

hpages commented 2 years ago

The error we get with your edited minimal example is a lot more interesting:

coverage(x)[bins]
# Error in .numeric2end(x, NG) : 
#   when 'x' is an integer vector, it cannot contain NAs or negative values
# In addition: Warning message:
# In bindROWS(x, list(...), ignore.mcols = ignore.mcols) :
#   integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

This should actually work and return an RleList object parallel to bins i.e. with one coverage vector per bin in bins. Since your number of bins is only 3103, coverage(x)[bins] will be an RleList of length 3103. Far from being a long Vector derivative! The problem comes from the way this subsetting operation is implemented. I'll work on this.

BTW, if you're only interested in getting the overall coverage for each bin rather than the full coverage vector, you can use binnedAverage():

x <- GRanges(c("chr1:1-1000", "chr1:11-1010", "chr1:1000001-1500000", "chrM:1-16569"), seqinfo=seqinfo(bins))
x
# GRanges object with 4 ranges and 0 metadata columns:
#       seqnames          ranges strand
#          <Rle>       <IRanges>  <Rle>
#   [1]     chr1          1-1000      *
#   [2]     chr1         11-1010      *
#   [3]     chr1 1000001-1500000      *
#   [4]     chrM         1-16569      *
#   -------
#   seqinfo: 25 sequences (1 circular) from hg38 genome

binnedAverage(bins, coverage(x), "average_coverage")
# GRanges object with 3103 ranges and 1 metadata column:
#          seqnames            ranges strand | average_coverage
#             <Rle>         <IRanges>  <Rle> |        <numeric>
#      [1]     chr1         1-1000000      * |            0.002
#      [2]     chr1   1000001-2000000      * |            0.500
#      [3]     chr1   2000001-3000000      * |            0.000
#      [4]     chr1   3000001-4000000      * |            0.000
#      [5]     chr1   4000001-5000000      * |            0.000
#      ...      ...               ...    ... .              ...
#   [3099]     chrY 54000001-55000000      * |                0
#   [3100]     chrY 55000001-56000000      * |                0
#   [3101]     chrY 56000001-57000000      * |                0
#   [3102]     chrY 57000001-57227415      * |                0
#   [3103]     chrM           1-16569      * |                1
#   -------
#   seqinfo: 25 sequences (1 circular) from hg38 genome

See ?binnedAverage for more information.

H.

ssayols commented 2 years ago

sweet! I didn't know of this binnedAverage() function. Many thanks! Actually I just made up this minimal example with the average, but it's actually not what I'm trying to solve. But it's very useful in any case: looking at the code of this function gives me an idea on how I could implement more elaborated grouping operations (other than the mean).

The problem comes from the way this subsetting operation is implemented.

Regarding the initial issue, it also happens when, for instance, dividing the regions into 2 set and concatenating the results later on:

bincov_1 <- coverage(x)[bins[1:2000]]               # <-- this works
bincov_2 <- coverage(x)[bins[2001:length(bins)]]    # <-- this works
c(bincov_1, bincov_2)     # again, this will raise the same error

Both subsetting and concatenating break at the same point: IRanges::bindROWS() (the overflow in cumsum()) and a bit later in .numeric2end() (implicit conversions to integer type).

Many thanks for looking into this. S.

hpages commented 2 years ago

c(bincov_1, bincov_2) # again, this will raise the same error

Yes, at the root of the problem is that coverage(x)[bins[1:2000]] and coverage(x)[bins[2001:length(bins)]] both returned an object of class CompressedRleList instead of SimpleRleList:

x_cov <- coverage(x)
class(x_cov)  # SimpleRleList

bincov_1 <- x_cov[bins[1:2000]]
bincov_2 <- x_cov[bins[2001:length(bins)]]
class(bincov_1)  # CompressedRleList
class(bincov_2)  # CompressedRleList

RleList is a virtual class with 2 concrete subclasses CompressedRleList and SimpleRleList, sometimes called flavors. Each flavor uses a different internal representation, and each representation has its own pros and cons. For example, some operations like unlist() are a lot faster on the CompressedRleList representation than on the SimpleRleList representation. In fact, in the case of unlist(), it's almost instantaneous. The cost is virtually zero whatever the size of the object! However, the CompressedRleList representation has a big drawback: it can't be used if sum(lengths(x)) is >= 2^31.

Typically an operation like subsetting should preserve the class (endomorphism). However, in the case of subsetting by a GRanges object, the exact class is not preserved. This was a conscious choice because it allows this subsetting operation to be implemented in a very efficient way, much faster than if it had returned a SimpleRleList object. But of course this doesn't work if the object to return is a CompressedRleList object such that sum(lengths(x)) is >= 2^31. This needs to be addressed and that's why I created issue #44.

Note that if you turn bincov_1 and bincov_2 into SimpleRleList objects again, then the problem goes away:

bincov_1 <- as(bincov_1, "SimpleRleList")
bincov_2 <- as(bincov_2, "SimpleRleList")
c(bincov_1, bincov_2)  # works!

H.

ssayols commented 2 years ago

Awesome! Thanks for the explanation!

RNieuwenhuis commented 1 year ago

Dear @hpages ,

Have there been any updates in the support of long vectors for the packages you mentioned in your first comment of this thread? I was always under the impression this was now supported and was surprised when I ran into this limitation.