deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.56k stars 161 forks source link

Binseg incorrect for small data, l2 loss and min_size=1 #242

Closed tdhock closed 2 years ago

tdhock commented 2 years ago

Hi @deepcharles thanks for putting your implementation of binary segmentation online. I am trying to use it to compute the full path of binary segmentation models, but the predict method is giving me an exception for large numbers of breakpoints, and incorrect changepoints for small numbers of breakpoints. Here is a small data example:

import ruptures as rpt
import numpy as np
rpt.version.version
data_list = [0.0, 0.1, 1.2, 1.0]
N_data = len(data_list)
data_mat = np.array(data_list).reshape(N_data,1)
algo = rpt.Binseg(model="l2",min_size=1,jump=1).fit(data_mat)
computed_break_dict={}
for n_bkps in range(N_data):
    try:
        result = algo.predict(n_bkps=n_bkps)
    except Exception as e:
        result = e
    computed_break_dict[n_bkps] = result
computed_break_dict
expected_break_dict = {
    0:[4],
    1:[2,4],
    2:[2,3,4],
    3:[1,2,3,4]
    }

In this case there are four data points. I expected the first change to be after the second data point, etc (as shown in expected_break_dict above). Instead I got the results in computed_break_dict, which are not expected:

deepcharles commented 2 years ago

Hi,

Thanks for raising this issue. Here, the min_size (minimum segment length) is not handled correctly, resulting in bad segmentations without raising a BadSegmentation error. We will correct it.

oboulant commented 2 years ago

Hi @tdhock ,

Indeed, and we have to address it !

But, just so you know, if you use a min_size >= 2, then you have the expected results for the given min_size.

Here, the misbehaviour is due to the fact that, from the user perspective you think you have a min_size to 1, but in really it is overwritten by the minimum value for a given cost (here L2). See here for the minimum value for L2 cost. See here for the place where the set value (you set value to 1) is overwritten to 2.

Hope it helps !

tdhock commented 2 years ago

ok, but for the l2 cost (square loss change in normal mean, constant variance) the min_size can be 1, so I still expect a valid segmentation should be returned (not an exception).

oboulant commented 2 years ago

Closing the issue since it has been addressed in https://github.com/deepcharles/ruptures/pull/255