set_cover.approx_multiuniverse() constructs the universes by taking
the union over all input sets. When use_intervalsets is True,
this was previously slow. (use_intervalsets is True for most runs because
set_cover_filter sets it to True.) It worked by creating an empty
IntervalSet, and then repeatedly taking the union of the universe's
IntervalSet with another input set (also represented as an IntervalSet).
In the worst-case, taking the union takes O(n) time where n is the
current size of the universe. As a result, constructing the universe
took, in the worst-case, O(n^2) time. For particularly long genomes,
where n is large, this could take hours in practice.
This fix instead initializes each universe by simply constructing
a list of the input intervals for it, and then initializing an IntervalSet
from the list. Initializing the IntervalSet already merges overlapping
intervals (effectively taking the union of all of them) in O(N log N)
time, where N is the number of input intervals. Since the input sets
are constructed from the universe, N is O(n), where n is defined above.
So, now, constructing the universe takes, in the worst-case, O(n log n)
time. In practice, this seems to provide a noticeable speed up as well.
Coverage increased (+0.008%) to 95.006% when pulling 8c98893155f213adbf2d5c6755eaccda3c774fb9 on speed-sc-setup into 745499f9e2c3d2c88fc6da45b0e10c6c58a6de22 on master.
set_cover.approx_multiuniverse()
constructs the universes by taking the union over all input sets. Whenuse_intervalsets
is True, this was previously slow. (use_intervalsets
is True for most runs becauseset_cover_filter
sets it to True.) It worked by creating an empty IntervalSet, and then repeatedly taking the union of the universe's IntervalSet with another input set (also represented as an IntervalSet). In the worst-case, taking the union takes O(n) time where n is the current size of the universe. As a result, constructing the universe took, in the worst-case, O(n^2) time. For particularly long genomes, where n is large, this could take hours in practice.This fix instead initializes each universe by simply constructing a list of the input intervals for it, and then initializing an IntervalSet from the list. Initializing the IntervalSet already merges overlapping intervals (effectively taking the union of all of them) in O(N log N) time, where N is the number of input intervals. Since the input sets are constructed from the universe, N is O(n), where n is defined above. So, now, constructing the universe takes, in the worst-case, O(n log n) time. In practice, this seems to provide a noticeable speed up as well.