Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

RleList subsetting has integer overflow on mammal genomes #79

Open alexfriman opened 11 months ago

alexfriman commented 11 months ago

Problem: Subsetting an RleList by GRanges object results in integer overflow error, but there is a workaround.

Expected behavior: Code produces an RleList in cases A, B, C.

Observed behavior: Code throws error in cases A and B. Code produces an RleList in case C.

Produced error:

In bindROWS(x, list(...), ignore.mcols = ignore.mcols) : integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

Required libraries: GenomicRanges , BSgenome, BSgenome.Mmusculus.UCSC.mm10

The test case code:

#minimal set to reproduce integer overflow: 
#We generate randomly distributed GRanges across the mouse genome. 
#Take coverage and subset the coverage by the genome itself

library(GenomicRanges)
library(BSgenome)
#loding the mouse genome
Mouse10BSgenome <- keepStandardChromosomes(GenomeInfoDb::seqinfo(getBSgenome("BSgenome.Mmusculus.UCSC.mm10")))
#removing the circular chromosomes
Mouse10BSgenome <- GenomeInfoDb::dropSeqlevels(x = Mouse10BSgenome, 
                              value = names(which(isCircular(Mouse10BSgenome))))
#convering to GRanges 
Mouse10GR<-GRanges(Mouse10BSgenome)
numberOfChromosomes<-length(seqnames(Mouse10GR))

#generating small Genomic Ranges randomly distrubuted across the mouse genome
numberOfSmallRanges=200
lengthOfSmallRanges=200

#number of smallRanges at each chromosome (randomly distributed across the genome)
rangesPerChr <- as.list(rmultinom(n = 1, size = numberOfSmallRanges, 
                              prob = rep(1/numberOfChromosomes, numberOfChromosomes)))
names(rangesPerChr)<-as.vector(seqnames(Mouse10GR))

#populating the chromosomes with small GRanges one by one
smallGR<-GRanges()
for (i in 1:numberOfChromosomes){
  chrName<-names(Mouse10BSgenome)[i]
  chrLen<-seqlengths(Mouse10BSgenome)[i]
  numRanges<-rangesPerChr[[i]]
  smallGR<-c(smallGR, GRanges(seqnames = rep(chrName, numRanges), 
                              ranges = IRanges(start = sample(1:chrLen, numRanges), width = lengthOfSmallRanges)))
}
#generating coverage
mycov<-coverage(smallGR)
#adding zeros to the coverage to reach the chromosome length plus a little bit more
for (i in 1:length(mycov)){
  chrname<-names(mycov)[i]
  targetLen<-seqlengths(Mouse10BSgenome)[i]
  currentLen<-length(mycov[[chrname]])
  deltaLen<-targetLen  - currentLen+2*lengthOfSmallRanges #add extra 2 lengthOfSmallRanges just to be sure
  mycov[[chrname]]<-c(mycov[[chrname]],rep(0,deltaLen))
}

#Case A: does not work
mycov[Mouse10GR]

#Case B: does not work
#splitting the genome into 2 subsets
Mouse10GRfirst<-Mouse10GR[1:13]
Mouse10GRsecond<-Mouse10GR[14:21]
covFirst<-mycov[Mouse10GRfirst]
covSecond<-mycov[Mouse10GRsecond]
c(covFirst, covSecond) #does not work

#Case C: works
c(coverage(GRanges()),covFirst, covSecond)  

I'm okay with using the workaround (Case C), but I need to understand why it works and why A,B do not work. Otherwise it would be wrong to use the workaround for manuscript preparation.

sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.6.7

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0 rtracklayer_1.56.1 Biostrings_2.64.1
[5] XVector_0.36.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.4 IRanges_2.30.1
[9] S4Vectors_0.34.0 BiocGenerics_0.42.0

loaded via a namespace (and not attached): [1] rstudioapi_0.14 zlibbioc_1.42.0 GenomicAlignments_1.32.1 BiocParallel_1.30.4 lattice_0.20-45 rjson_0.2.21
[7] tools_4.2.1 grid_4.2.1 SummarizedExperiment_1.26.1 parallel_4.2.1 Biobase_2.56.0 matrixStats_0.63.0
[13] yaml_2.3.7 crayon_1.5.2 BiocIO_1.6.0 Matrix_1.5-3 GenomeInfoDbData_1.2.8 restfulr_0.0.15
[19] bitops_1.0-7 codetools_0.2-18 RCurl_1.98-1.12 DelayedArray_0.22.0 compiler_4.2.1 MatrixGenerics_1.8.1
[25] Rsamtools_2.12.0 XML_3.99-0.14

log2(.Machine$integer.max) [1] 31