Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

Could `GenomicRanges::binnedAverage()` be reused for other purposes? #77

Open ssayols opened 1 year ago

ssayols commented 1 year ago

Hi Herve, this is more of a suggestion rather than a bug. Would it make sense to make the binnedAverage() function more general, in a way that it could compute more than just the mean? If I understand the code, it's relatively straightforward to call any function in IRanges::view*(). Something like this (please notice the new parameter at the end of the header. Defaults to IRanges::viewMeans to mimic the behavior of binnedAveage()):

binnedView <- function(bins, numvar, varname, na.rm=FALSE, fun=IRanges::viewMeans) {
    if (!is(bins, "GRanges")) 
        stop("'x' must be a GRanges object")
    if (!is(numvar, "RleList")) 
        stop("'numvar' must be an RleList object")
    if (!identical(seqlevels(bins), names(numvar))) 
        stop("'seqlevels(bin)' and 'names(numvar)' must be identical")
    fun2 <- function(v, na.rm = FALSE) {
        if (!isTRUEorFALSE(na.rm)) 
            stop("'na.rm' must be TRUE or FALSE")
        result <- fun(v, na.rm = na.rm)
        w0 <- width(v)
        v1 <- trim(v)
        w1 <- width(v1)
        if (na.rm) {
            na_count <- sum(is.na(v1))
            w0 <- w0 - na_count
            w1 <- w1 - na_count
        }
        result <- result * w1/w0
        result[w0 != 0L & w1 == 0L] <- 0
        result
    }
    bins_per_chrom <- split(ranges(bins), seqnames(bins))
    result_list <- lapply(names(numvar), function(seqname) {
        v <- Views(numvar[[seqname]], bins_per_chrom[[seqname]])
        fun2(v, na.rm = na.rm)
    })
    new_mcol <- unsplit(result_list, as.factor(seqnames(bins)))
    mcols(bins)[[varname]] <- new_mcol
    bins
}

This would enable other ways of aggregating signal in bins (eg. by setting fun=IRanges::viewSums).

cheers, Sergi

cgroza commented 5 months ago

I just tried to use this function to compute averages for 15 million bins. It was taking hours to complete. Meanwhile, my own solution with findOverlaps and dplyr did the same in 5 minutes. I don't know why it's inefficient but I wouldn't rely on it too much.

ssayols commented 5 months ago

I posted this function here without any warranty that works efficiently for all uses cases. Please keep in mind this is not part of GenomicRanges, nor am I a contributor of the package. Just a layman.

Nevertheless, GenomicRanges::binnedAverage() takes barely 2 seconds to compute the average signal for 15 million bins in my 6 years old laptop:

library(GenomicRanges)
bins <- unlist(tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), ntile=15e6))
signal <- GRanges("chr1:1")   # a phony signal track
seqinfo(signal) <- seqinfo(bins)
system.time({
  avg <- binnedAverage(bins, coverage(signal), "average_coverage")
})
   user  system elapsed 
  2.217   0.251   2.471 
ssayols commented 5 months ago

Btw I just tried the function above (binnedView()) to compute the average signal of 15M bins, and it also takes ~2 seconds to run:

system.time({
  avg <- binnedView(bins, coverage(signal), "average_coverage", fun=IRanges::viewMeans)
})

   user  system elapsed 
  2.213   0.124   2.338

Perhaps your data (or data structures) are innappropriate?

cgroza commented 5 months ago

I must be doing something wrong then. I am using an Rle object. Do I have to sort the data beforehand to achieve this performance.