broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
76 stars 16 forks source link

Speed up construction of universe from IntervalSets #21

Closed haydenm closed 5 years ago

haydenm commented 5 years ago

set_cover.approx_multiuniverse() constructs the universes by taking the union over all input sets. When use_intervalsets is True, this was previously slow. (use_intervalsets is True for most runs because set_cover_filter sets it to True.) It worked by creating an empty IntervalSet, and then repeatedly taking the union of the universe's IntervalSet with another input set (also represented as an IntervalSet). In the worst-case, taking the union takes O(n) time where n is the current size of the universe. As a result, constructing the universe took, in the worst-case, O(n^2) time. For particularly long genomes, where n is large, this could take hours in practice.

This fix instead initializes each universe by simply constructing a list of the input intervals for it, and then initializing an IntervalSet from the list. Initializing the IntervalSet already merges overlapping intervals (effectively taking the union of all of them) in O(N log N) time, where N is the number of input intervals. Since the input sets are constructed from the universe, N is O(n), where n is defined above. So, now, constructing the universe takes, in the worst-case, O(n log n) time. In practice, this seems to provide a noticeable speed up as well.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.008%) to 95.006% when pulling 8c98893155f213adbf2d5c6755eaccda3c774fb9 on speed-sc-setup into 745499f9e2c3d2c88fc6da45b0e10c6c58a6de22 on master.