Open ssayols opened 1 year ago
I just tried to use this function to compute averages for 15 million bins. It was taking hours to complete. Meanwhile, my own solution with findOverlaps and dplyr did the same in 5 minutes. I don't know why it's inefficient but I wouldn't rely on it too much.
I posted this function here without any warranty that works efficiently for all uses cases. Please keep in mind this is not part of GenomicRanges
, nor am I a contributor of the package. Just a layman.
Nevertheless, GenomicRanges::binnedAverage()
takes barely 2 seconds to compute the average signal for 15 million bins in my 6 years old laptop:
library(GenomicRanges)
bins <- unlist(tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), ntile=15e6))
signal <- GRanges("chr1:1") # a phony signal track
seqinfo(signal) <- seqinfo(bins)
system.time({
avg <- binnedAverage(bins, coverage(signal), "average_coverage")
})
user system elapsed
2.217 0.251 2.471
Btw I just tried the function above (binnedView()
) to compute the average signal of 15M bins, and it also takes ~2 seconds to run:
system.time({
avg <- binnedView(bins, coverage(signal), "average_coverage", fun=IRanges::viewMeans)
})
user system elapsed
2.213 0.124 2.338
Perhaps your data (or data structures) are innappropriate?
I must be doing something wrong then. I am using an Rle object. Do I have to sort the data beforehand to achieve this performance.
Hi Herve, this is more of a suggestion rather than a bug. Would it make sense to make the
binnedAverage()
function more general, in a way that it could compute more than just the mean? If I understand the code, it's relatively straightforward to call any function inIRanges::view*()
. Something like this (please notice the new parameter at the end of the header. Defaults toIRanges::viewMeans
to mimic the behavior ofbinnedAveage()
):This would enable other ways of aggregating signal in bins (eg. by setting
fun=IRanges::viewSums
).cheers, Sergi