Closed ssayols closed 2 years ago
Hi,
It's not just IRanges or GRanges objects. Long Vector derivatives are not supported at the moment. More precisely: the length of any Vector derivative must be <= .Machine$int.max
(i.e. <= 2^31 - 1
). This includes CharacterList, IntegerList, IRanges, GRanges, GPos, GRangesList, DNAString, DNAStringSet, SummarizedExperiment, and many more Vector derivatives.
BTW the error you get with your minimal example above has nothing to do with this:
library(BSgenome.Hsapiens.UCSC.hg38)
bins <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), tilewidth=1e6, cut.last.tile=TRUE)
bins <- keepStandardChromosomes(bins, pruning.mode="coarse")
x <- GRanges("chr1:1")
seqinfo(x) <- seqinfo(bins)
x[bins]
# Error in .subset_by_GenomicRanges(x, i) :
# 'x' must have names when subsetting by a GenomicRanges subscript
I'm not sure what you're trying to achieve here, that code doesn't make much sense. But I would hope that the error message is quite clear about what the problem is.
FWIW here is a simple example that reveals the .Machine$int.max
limitation for Vector derivatives:
hg38chromranges <- GRanges(seqinfo(BSgenome.Hsapiens.UCSC.hg38))
hg38chromranges
# GRanges object with 640 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# chr1 chr1 1-248956422 *
# chr2 chr2 1-242193529 *
# chr3 chr3 1-198295559 *
# chr4 chr4 1-190214555 *
# chr5 chr5 1-181538259 *
# ... ... ... ...
# chr22_KN196486v1_alt chr22_KN196486v1_alt 1-153027 *
# chr22_KQ458387v1_alt chr22_KQ458387v1_alt 1-155930 *
# chr22_KQ458388v1_alt chr22_KQ458388v1_alt 1-174749 *
# chr22_KQ759761v1_alt chr22_KQ759761v1_alt 1-145162 *
# chrX_KV766199v1_alt chrX_KV766199v1_alt 1-188004 *
# -------
# seqinfo: 640 sequences (1 circular) from hg38 genome
GPos(hg38chromranges[1:3])
# StitchedGPos object with 689445510 positions and 0 metadata columns:
# seqnames pos strand
# <Rle> <integer> <Rle>
# [1] chr1 1 *
# [2] chr1 2 *
# [3] chr1 3 *
# [4] chr1 4 *
# [5] chr1 5 *
# ... ... ... ...
# [689445506] chr3 198295555 *
# [689445507] chr3 198295556 *
# [689445508] chr3 198295557 *
# [689445509] chr3 198295558 *
# [689445510] chr3 198295559 *
# -------
# seqinfo: 640 sequences (1 circular) from hg38 genome
GPos(hg38chromranges)
# Error in new_StitchedIPos(pos) : too many positions in 'pos'
A typical way to work around this limitation is to work at the chromosome level instead of whole genome level, and to combine the results. In my experience this is usually not too hard to do and it generally works quite well. It can also improve performance which is not surprising since this is what happens in general when breaking the data into small chunks.
Hope this helps, H.
Hi Herve, Sorry there was a mistake in the minimal example, exactly in the key line throwing the error. I edited it accordingly, hope now it's clear what I meant. But in any case your answer is still valid and means currently there's no way to have Vectors larger than 2^31.
Many thanks for the feedback! S.
The error we get with your edited minimal example is a lot more interesting:
coverage(x)[bins]
# Error in .numeric2end(x, NG) :
# when 'x' is an integer vector, it cannot contain NAs or negative values
# In addition: Warning message:
# In bindROWS(x, list(...), ignore.mcols = ignore.mcols) :
# integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
This should actually work and return an RleList object parallel to bins
i.e. with one coverage vector per bin in bins
. Since your number of bins is only 3103, coverage(x)[bins]
will be an RleList of length 3103. Far from being a long Vector derivative! The problem comes from the way this subsetting operation is implemented. I'll work on this.
BTW, if you're only interested in getting the overall coverage for each bin rather than the full coverage vector, you can use binnedAverage()
:
x <- GRanges(c("chr1:1-1000", "chr1:11-1010", "chr1:1000001-1500000", "chrM:1-16569"), seqinfo=seqinfo(bins))
x
# GRanges object with 4 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-1000 *
# [2] chr1 11-1010 *
# [3] chr1 1000001-1500000 *
# [4] chrM 1-16569 *
# -------
# seqinfo: 25 sequences (1 circular) from hg38 genome
binnedAverage(bins, coverage(x), "average_coverage")
# GRanges object with 3103 ranges and 1 metadata column:
# seqnames ranges strand | average_coverage
# <Rle> <IRanges> <Rle> | <numeric>
# [1] chr1 1-1000000 * | 0.002
# [2] chr1 1000001-2000000 * | 0.500
# [3] chr1 2000001-3000000 * | 0.000
# [4] chr1 3000001-4000000 * | 0.000
# [5] chr1 4000001-5000000 * | 0.000
# ... ... ... ... . ...
# [3099] chrY 54000001-55000000 * | 0
# [3100] chrY 55000001-56000000 * | 0
# [3101] chrY 56000001-57000000 * | 0
# [3102] chrY 57000001-57227415 * | 0
# [3103] chrM 1-16569 * | 1
# -------
# seqinfo: 25 sequences (1 circular) from hg38 genome
See ?binnedAverage
for more information.
H.
sweet! I didn't know of this binnedAverage()
function. Many thanks! Actually I just made up this minimal example with the average, but it's actually not what I'm trying to solve. But it's very useful in any case: looking at the code of this function gives me an idea on how I could implement more elaborated grouping operations (other than the mean).
The problem comes from the way this subsetting operation is implemented.
Regarding the initial issue, it also happens when, for instance, dividing the regions into 2 set and concatenating the results later on:
bincov_1 <- coverage(x)[bins[1:2000]] # <-- this works
bincov_2 <- coverage(x)[bins[2001:length(bins)]] # <-- this works
c(bincov_1, bincov_2) # again, this will raise the same error
Both subsetting and concatenating break at the same point: IRanges::bindROWS()
(the overflow in cumsum()
) and a bit later in .numeric2end()
(implicit conversions to integer type).
Many thanks for looking into this. S.
c(bincov_1, bincov_2) # again, this will raise the same error
Yes, at the root of the problem is that coverage(x)[bins[1:2000]]
and coverage(x)[bins[2001:length(bins)]]
both returned an object of class CompressedRleList instead of SimpleRleList:
x_cov <- coverage(x)
class(x_cov) # SimpleRleList
bincov_1 <- x_cov[bins[1:2000]]
bincov_2 <- x_cov[bins[2001:length(bins)]]
class(bincov_1) # CompressedRleList
class(bincov_2) # CompressedRleList
RleList is a virtual class with 2 concrete subclasses CompressedRleList and SimpleRleList, sometimes called flavors. Each flavor uses a different internal representation, and each representation has its own pros and cons. For example, some operations like unlist()
are a lot faster on the CompressedRleList representation than on the SimpleRleList representation. In fact, in the case of unlist()
, it's almost instantaneous. The cost is virtually zero whatever the size of the object! However, the CompressedRleList representation has a big drawback: it can't be used if sum(lengths(x))
is >= 2^31.
Typically an operation like subsetting should preserve the class (endomorphism). However, in the case of subsetting by a GRanges object, the exact class is not preserved. This was a conscious choice because it allows this subsetting operation to be implemented in a very efficient way, much faster than if it had returned a SimpleRleList object. But of course this doesn't work if the object to return is a CompressedRleList object such that sum(lengths(x))
is >= 2^31. This needs to be addressed and that's why I created issue #44.
Note that if you turn bincov_1
and bincov_2
into SimpleRleList objects again, then the problem goes away:
bincov_1 <- as(bincov_1, "SimpleRleList")
bincov_2 <- as(bincov_2, "SimpleRleList")
c(bincov_1, bincov_2) # works!
H.
Awesome! Thanks for the explanation!
Dear @hpages ,
Have there been any updates in the support of long vectors for the packages you mentioned in your first comment of this thread? I was always under the impression this was now supported and was surprised when I ran into this limitation.
Hi Michael, Herve, I'm trying to fetch the [human] genome-wide coverage of a
CompressedRleList
using a combination ofcoverage()
from theGenomicRanges
package and[
fromIRanges
(well probably[
is from S4Vectors but extended in IRanges?). I realized this is not possible, as it seems that some functions inIRanges
expect the number of elements to fit in a 32-bit integer.I'm most likely abusing the original use-case which these functions were designed for. Still, I wanted to ask whether this limitation with 32-bit integers is something known and intentionally meant to be there.
Here's a minimal example (get the average coverage of 1M bp bins):
This seems to work fine with organisms with shorter genomes (< 2^31 bp, like Scerevisiase).
The methods in question which expect 32 bit integers are
bindROWS()
from theCompressedList
class (produce an integer overflow in acumsum()
call), andPartitioningByEnd()
which internally calls.numeric2end()
and does implicit conversions to integer type.I'm currently using Bioconductor 3.14 through Bioconductor's official Docker image:
Many thanks for your feedback.