Binseg incorrect for small data, l2 loss and min_size=1

tdhock commented 2 years ago

Hi @deepcharles thanks for putting your implementation of binary segmentation online. I am trying to use it to compute the full path of binary segmentation models, but the predict method is giving me an exception for large numbers of breakpoints, and incorrect changepoints for small numbers of breakpoints. Here is a small data example:

import ruptures as rpt
import numpy as np
rpt.version.version
data_list = [0.0, 0.1, 1.2, 1.0]
N_data = len(data_list)
data_mat = np.array(data_list).reshape(N_data,1)
algo = rpt.Binseg(model="l2",min_size=1,jump=1).fit(data_mat)
computed_break_dict={}
for n_bkps in range(N_data):
    try:
        result = algo.predict(n_bkps=n_bkps)
    except Exception as e:
        result = e
    computed_break_dict[n_bkps] = result
computed_break_dict
expected_break_dict = {
    0:[4],
    1:[2,4],
    2:[2,3,4],
    3:[1,2,3,4]
    }

In this case there are four data points. I expected the first change to be after the second data point, etc (as shown in expected_break_dict above). Instead I got the results in computed_break_dict, which are not expected:

I expected the result for n_bkps=1 to be a list of two segment ends, but I only got one.

I expected the results for n_bkps=2 or 3 to be a list of segment ends, but I got BadSegmentationParameters exceptions. Am I using Binseg correctly, and is this a bug? If I am not using it correctly, can you please explain the correct way? Thanks in advance! Here is the output I got from running the code above on my system:

Python 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:11:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ruptures as rpt
>>> import numpy as np
>>> rpt.version.version
'1.1.6'
>>> data_list = [0.0, 0.1, 1.2, 1.0]
>>> N_data = len(data_list)
>>> data_mat = np.array(data_list).reshape(N_data,1)
>>> algo = rpt.Binseg(model="l2",min_size=1,jump=1).fit(data_mat)
>>> computed_break_dict={}
>>> for n_bkps in range(N_data):
...     try:
...         result = algo.predict(n_bkps=n_bkps)
...     except Exception as e:
...         result = e
...     computed_break_dict[n_bkps] = result
>>> computed_break_dict
{0: [4], 1: [4], 2: BadSegmentationParameters(), 3: BadSegmentationParameters()}
>>> expected_break_dict = {
...     0:[4],
...     1:[2,4],
...     2:[2,3,4],
...     3:[1,2,3,4]
...     }

deepcharles commented 2 years ago

Hi,

Thanks for raising this issue. Here, the min_size (minimum segment length) is not handled correctly, resulting in bad segmentations without raising a BadSegmentation error. We will correct it.

oboulant commented 2 years ago

Hi @tdhock ,

Indeed, and we have to address it !

But, just so you know, if you use a min_size >= 2, then you have the expected results for the given min_size.

Here, the misbehaviour is due to the fact that, from the user perspective you think you have a min_size to 1, but in really it is overwritten by the minimum value for a given cost (here L2). See here for the minimum value for L2 cost. See here for the place where the set value (you set value to 1) is overwritten to 2.

Hope it helps !

tdhock commented 2 years ago

ok, but for the l2 cost (square loss change in normal mean, constant variance) the min_size can be 1, so I still expect a valid segmentation should be returned (not an exception).

oboulant commented 2 years ago

Closing the issue since it has been addressed in https://github.com/deepcharles/ruptures/pull/255

deepcharles / ruptures

Binseg incorrect for small data, l2 loss and min_size=1 #242