Specialize findOverlaps for GRangesFactor objects

LTLA commented 5 years ago

This should improve efficiency of the overlaps... which was the whole aim of this class in the first place.

hpages commented 5 years ago

Hmm, the findOverlaps(..., select="all") + selectHits() strategy is certainly going to slow down things a lot in some situations. This is because the findOverlaps#GenomicRanges#GenomicRanges method is highly optimized when select is "first", "last", or "arbitrary". In these cases the method collects at most 1 hit per query (and stores it directly in the integer vector to return) rather than collect all the hits in a Hits object to later drop most of them. In addition the length of the integer vector is known in advance so the vector can be pre-allocated whereas in the select="all" case the final size of the Hits object is not known in advance, which means that the object cannot be pre-allocated so has to be grown via re-allocations and copies:

library(GenomicRanges)
query <- GRanges("chr1", IRanges(1, 1:9500))
subject <- GRanges("chr1", IRanges(1:9500, 9500))
system.time(q2s <- findOverlaps(query, subject, select="arbitrary"))
#    user  system elapsed 
#   0.029   0.000   0.031 
system.time(hits <- findOverlaps(query, subject, select="all"))
#    user  system elapsed 
#   2.948   0.407   3.355

The more number of hits per query (in average), the worse select="all" will perform with respect to select="first", "last", or "arbitrary".

The select="arbitrary" case is the workhorse behind overlapsAny(), %over%, and %within%.

LTLA commented 5 years ago

Well, I can't say it was easy, but select!="all" optimizations are done. Note that the lack of special behaviour for a GRF subject when select="arbitrary" is deliberate; I'd have to unique the indices anyway to ensure that the query doesn't select a range that isn't used.

LTLA commented 4 years ago

Nudge.

Bioconductor / GenomicRanges

Specialize findOverlaps for GRangesFactor objects #28