deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.6k stars 163 forks source link

Suggestion - refactor(Binseg): use priority queue #226

Closed Lucas-Prates closed 2 years ago

Lucas-Prates commented 2 years ago

Hello again!

I was looking at the implementation of Binseg class, more specifically the fitting part of the _seg method:

bkps = [self.n_samples]
stop = False
while not stop:
    stop = True
    new_bkps = [
        self.single_bkp(start, end) for start, end in pairwise([0] + bkps)
    ]
    bkp, gain = max(new_bkps, key=lambda x: x[1])

    if bkp is None:  # all possible configuration have been explored.
        break

    if n_bkps is not None:
        if len(bkps) - 1 < n_bkps:
            stop = False
    elif pen is not None:
        if gain > pen:
            stop = False
    elif epsilon is not None:
        error = self.cost.sum_of_costs(bkps)
        if error > epsilon:
            stop = False

    if not stop:
        bkps.append(bkp)
        bkps.sort()

At first glance, it looked like it could be improved a bit. For instance, avoiding recreating _newbkps, avoid using pairwise, avoid resorting bkps and avoid recomputing _self.singlebkp on already computed indices. For the last one, however, the decorator _@lrucache already seems to do a great job avoid recomputation.

I tried to address those points using a priority queue (via min-heap from heapq module), as follows:

bkps = [self.n_samples]
start, end = 0, self.n_samples
bkp, gain = self.single_bkp(start, end)

node = (-gain, start, bkp, end)
bkp_queue = [node]
heapq.heapify(bkp_queue) # min heap
stop = False
while not stop:
    stop = True
    gain, start, bkp, end = heapq.heappop(bkp_queue) 
    gain = -gain

    if bkp is None:  # all possible configuration have been explored.
        break

    if n_bkps is not None:
        if len(bkps) - 1 < n_bkps:
            stop = False
    elif pen is not None:
        if gain > pen:
            stop = False
    elif epsilon is not None:
        error = self.cost.sum_of_costs(bkps)
        if error > epsilon:
            stop = False

    if not stop:
        bkps.append(bkp)
        # searches (start, bkp) and pushes to queue
        node_bkp, node_gain = self.single_bkp(start, bkp)
        node = (-node_gain, start, node_bkp, bkp)
        heapq.heappush(bkp_queue, node)

        # searches (bkp, end) and pushes to queue
        node_bkp, node_gain = self.single_bkp(bkp, end)
        node = (-node_gain, bkp, node_bkp, end)
        heapq.heappush(bkp_queue, node)
bkps.sort()

This implementation also avoids recomputation without using _@lrucache. I tested it, and the output is exactly the same from the previous implementation. However, there was almost no performance gain. I did a simulation study with small sample sizes, also varying the number of breaks (20%, 40%, 60%, 80% of the sample size), and the figure below show the results.

runtime_comparison (copy)

Hence, I did not not provide a PR since there is no clear advantage of using this implementation, and it concerns a very important part of the package. Maybe, for large datasets, there is advantage since cache misses might start to become more common. However, I know little on the subject and my pc does not allow me to perform simulations with huge datasets.

Thank you!

deepcharles commented 2 years ago

Thanks again! This seems like a good idea. We'll look into it shortly

deepcharles commented 2 years ago

Closing for now.