deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.54k stars 160 forks source link

Normalization and Cost #310

Closed horsto closed 8 months ago

horsto commented 9 months ago

Hi, this is a question, not an issue. I have a bunch of features that I track over time. I am feeding them into

algo = rpt.Pelt(model=model, min_size=1, jump=1)
algo.fit(signal)
result = algo.predict(pen=p) # RESULT OF CHANGE POINT DETECTION

signal here is (for example) a 500x16 (timepoints x features). The features themselves live on pretty different scales, such that I thought that some kind of scaling / normalization (for example via https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale) could make sense. Now I wonder though how different costs would be affected by that. In the example I am attaching below you can see the normalized signal for L1 and L2 norms -> change points are depicted with dashed lines. You can see that there are some obvious misses there (calibrating the penalty helps sometimes, but is a finicky process). Should normalization be skipped altogether / is there a better alternative cost for these kind of signals?

Screenshot 2023-10-11 at 15 36 46
tg12 commented 8 months ago

What are you using to draw these graphs as an unrelated question!

Should normalization be skipped altogether / is there a better alternative cost for these kind of signals?

I do agree in some instances there might be a need to remove any pre processing of the data, this can be done upstream if needed unless it's an inherent part of the pelt algorithm.

horsto commented 8 months ago

It's not inherent to the pelt algorithm I think? Unless there is some hidden pre processing going on (?).

I would like to know whether I should do my own normalization up front, and how it might affect certain cost functions in the pelt algorithm (L1, L2, ...).

The plotting is just matplotlib + seaborn!

deepcharles commented 8 months ago

Hi,

Sorry for the late reply.

To normalize or not is task-dependant and there is no definite answer. For multivariate signals, PELT will detect the largest shifts, i.e., those with a large norm ||m_before - m_after|| where m_before and m_after are the multivariate averages just before and after the change. As an example, consider the following 2D signal. raw

One dimension has large shifts and the other has small shifts. Without normalization, only changes in the large dimension are detected.

rpt.display(s, [], rpt.Pelt().fit(s).predict(pen=50))

bkps1

After normalization, all changes are detected.

rpt.display(s, [], rpt.Pelt().fit(s).predict(pen=50))

bkps2

Hope this helps