Use `KernSeg` with model selection as described in JMLR paper

mlondschien commented 2 years ago

Sylvain Arlot, Alain Celisse and Zaid Harchaoui provide theory and a heuristic for model selection with KernSeg in their paper A Kernel Multiple Change-point Algorithm via Model Selection. See 3.3.2, Theorem 2 and appendix B.3.

The penalty they propose does not scale linearly with the number of change points, so sadly it is incompatible with the current implementation. Furthermore the heuristic they propose requires knowledge of the respective losses for a set of possible numbers of split points, which currently (to my best understanding) cannot be recovered without expensive refits.

It would be great if this could be added.

deepcharles commented 2 years ago

Hello,

Indeed this would be a great addition. Currently it is technically possible to do the heuristics with a single expensive fit. Indeed, because of the recursive structure of dynamic programming, after computing the best segmentation with K changes, all segmentation with 1, 2,..., K-1 changes are also available "for free".

For instance,

import matplotlib.pyplot as plt
import ruptures as rpt

# generate a piecewise constant signal (2D, 2000 samples, 10 change-points)
signal, bkps = rpt.pw_constant(n_samples=2000, n_features=2, n_bkps=10, noise_std=10)
rpt.display(signal, bkps)

# Choosing the segmentation algorithm
algo = rpt.KernelCPD(kernel="linear")
# setting the maximum number of considered change-points
n_bkps_max = 50

# doing the segmentation once. Smaller segmentations are then available as "free" by-products
algo.fit(signal).predict(n_bkps=n_bkps_max)

# plotting the empirical risks ("sum of costs") for each number of change-points.
list_of_costs = list()
for n_bkps in range(1, n_bkps_max+1):
    predicted_bkps = algo.predict(n_bkps=n_bkps)  # no computation is actually done
    empirical_risk_value = algo.cost.sum_of_costs(predicted_bkps)
    list_of_costs.append(empirical_risk_value)

plt.plot(range(1, n_bkps_max+1), list_of_costs, "*-")

The linear tail of the curve is clearly visible on this synthetic example. All that remains to be done is to implement the final part of the heuristics to find the "elbow".

mlondschien commented 2 years ago

Thank you! How can I recover the losses corresponding to the sequence of change points?

deepcharles commented 2 years ago

I have updated my answer.

deepcharles commented 2 years ago

Also, note that the command algo.cost.sum_of_costs(predicted_bkps) will work for any algorithm in ruptures.

oboulant commented 2 years ago

Hi @mlondschien ,

Thx for your interest in ruptures !

We actually have an "advanced example" in our documentation page that uses this nice property of ruptures.KernelCPD : Music segmentation

Moreover, in case you need to access them all at once, once you have perform the expensive fit with K change points, all segmentations with 1, 2,..., K-1 changes points are available in the segmentations_dict attribute :

import ruptures as rpt

# generate a piecewise constant signal (2D, 2000 samples, 10 change-points)
signal, bkps = rpt.pw_constant(n_samples=2000, n_features=2, n_bkps=10, noise_std=10)
rpt.display(signal, bkps)

# Choosing the segmentation algorithm
algo = rpt.KernelCPD(kernel="linear")
# setting the maximum number of considered change-points
n_bkps_max = 50

# doing the segmentation once. Smaller segmentations are then available as "free" by-products
algo.fit(signal).predict(n_bkps=n_bkps_max)

# print the dictionnary with all subsequent segmentations 
print(algo.segmentations_dict)

mlondschien commented 2 years ago

Thanks @deepcharles and @oboulant for the input and thank you for working on ruptures.

I believe the following code implements the heuristic defined in the JMLR paper:

import numpy as np
import ruptures as rpt
from scipy.special import betaln
from sklearn.linear_model import LinearRegression

def kernseg(X, n_bkps_max):
    algo = rpt.KernelCPD(kernel="rbf")

    algo.fit(X).predict(n_bkps=n_bkps_max)

    segmentations_values = [[len(X)]] + list(algo.segmentations_dict.values())
    costs = [algo.cost.sum_of_costs(est) for est in segmentations_values]

    # https://stackoverflow.com/questions/21767690/python-log-n-choose-k
    log_nchoosek = [
        -betaln(1 + n_bkps_max - k, 1 + k) - np.log(n_bkps_max + 1)
        for k in range(0, n_bkps_max + 1)
    ]
    X_lm = np.array([log_nchoosek, range(0, n_bkps_max + 1)]).T
    lm = LinearRegression().fit(
        X_lm[int(0.6 * n_bkps_max) :, :], costs[int(0.6 * n_bkps_max) :]
    )
    adjusted_costs = costs - 2 * X_lm.dot(lm.coef_)

    return [0] + segmentations_values[np.argmin(adjusted_costs)]

Let me know if you would like to add this to the existing package. I could then open a PR.

deepcharles commented 2 years ago

If you are up for it, that would be great. The idea would be to do a didactic example of this heuristics (in the spirit of this example but with simulated data).

You only need to create a Jupyter notebook in docs/examples and we'll take care of the integration to the docs.

deepcharles / ruptures

Use `KernSeg` with model selection as described in JMLR paper #223