Open LTLA opened 5 years ago
Hmm, the findOverlaps(..., select="all")
+ selectHits()
strategy is certainly going to slow down things a lot in some situations. This is because the findOverlaps#GenomicRanges#GenomicRanges
method is highly optimized when select
is "first"
, "last"
, or "arbitrary"
. In these cases the method collects at most 1 hit per query (and stores it directly in the integer vector to return) rather than collect all the hits in a Hits object to later drop most of them. In addition the length of the integer vector is known in advance so the vector can be pre-allocated whereas in the select="all"
case the final size of the Hits object is not known in advance, which means that the object cannot be pre-allocated so has to be grown via re-allocations and copies:
library(GenomicRanges)
query <- GRanges("chr1", IRanges(1, 1:9500))
subject <- GRanges("chr1", IRanges(1:9500, 9500))
system.time(q2s <- findOverlaps(query, subject, select="arbitrary"))
# user system elapsed
# 0.029 0.000 0.031
system.time(hits <- findOverlaps(query, subject, select="all"))
# user system elapsed
# 2.948 0.407 3.355
The more number of hits per query (in average), the worse select="all"
will perform with respect to select="first"
, "last"
, or "arbitrary"
.
The select="arbitrary"
case is the workhorse behind overlapsAny()
, %over%
, and %within%
.
Well, I can't say it was easy, but select!="all"
optimizations are done. Note that the lack of special behaviour for a GRF subject
when select="arbitrary"
is deliberate; I'd have to unique
the indices anyway to ensure that the query
doesn't select a range that isn't used.
Nudge.
This should improve efficiency of the overlaps... which was the whole aim of this class in the first place.