New IndicatorChangesConfig: Continuous Linear Segments #53

Open Datseris opened 9 months ago

This issue is a continuation of the discussion originally in https://github.com/STOR-i/Changepoints.jl/issues/36 .

There we had the quote:

From my point of view, is the very important part of "change analysis" the continuous piece-wise linear regression. See for info the following paper. The method presented in this paper is similar to the standard PELT method for linear segments estimation, but it is moreover forced to be continuous (without step changes between linear segments!!!). See the code on GitLab, too.

The paper is "A new and general approach to signal denoising and eye movement classification based on segmented linear regression". It appears to be a blend between standard PELT and estimating changes in slope in indicator timeseries which @JanJereczek is interested in.

Following the outline I gave in #52 , one could define yet another analysis pipeline type that does the "PELT-like-continuus-linear-segments".

Holger Kantz gave a presentation where something similar is done:

A bi-linear function is fitted to the data (i.e., one line, then change to a line of differnt slope. Continuous fnction but discontinuous derivative). This type of function is fitted to the data (with e.g., a simple LsqFit.jl) and the change point is where the two the linear segments change slope.

So this is an extremely basic form of the algorithms we are discussing in this issue.

@stelmo I was considering whther I could use LinearSegmentation.jl for this. in principle, if I could force LinearSegmentation.jl to ensure that all linear segments are continuous (end points and start points coincide sequentially), then this would solve my problem. Do you know if this is possible? So far all figures in the README are discontinuous.

LinearSegmentation (LS) does give the overlap option, but that just enforces that neighboring segments share the same boundary points. The actual line of best fit does not need to be continuous, hence all the disjoint fits, even with overlap=true. It would be interesting to try to implement an option forcing a continuous fit. What time line are you talking about (I am going on holiday soon-ish for 3 weeks without internet access)? Also, I have found that LS is pretty sensitive to data noise and one needs to carefully tune the fitting thresholds to get nice fits, so that also needs a little work by me...

I don't have any timeline, this is a project without any deadlines! Therefore I'll wait for you because I don't know any internals of LS to add an option for enforcing continuity...

Okay, I extended my sliding_window function to have a continuous option (on master). Now you can create fits like this:

However, the sliding window method is very suboptimal (but simple and fast), and the papers you linked to look like they have much better algorithms. I plan on reading them + perhaps implementing their algos eventually. I hope to take a stab at this before I leave (3 weeks), but if not then, definitely before the end of the year.

For reference, the code that generated the plot:

using LinearSegmentation, CairoMakie

N = 100
xs = collect(range(0, 3 * pi, length = N)) .+ 0.1 .* randn(N)
ys = sin.(xs) .+ 0.1 .* randn(N)

segments = sliding_window(
    xs, 
    ys;
    min_segment_length = 0.5,
    fit_threshold = 0.1,
    fit_function = :rmse,
    overlap = true,
    continuous = true, # only on master
)

f, a = scatter(xs, ys)
for seg in segments
    b0, b1 = LinearSegmentation.endpoints(xs[seg], ys[seg])
    lines!(a, xs[seg], b0 .+ b1 .* xs[seg], linewidth=5)
end
f

This is very interesting! Thank you so much for trying this out. In your code you have enforced a minimum segment length, but you haven't enforced a maximum number of segments to use. How does the algorithm optimize this choice?

In my application scenarios I have timeseries. So the x axis is equi-spaced and sequential. Hence, the "optimal" line segment fit would be one that fits length(x)-1 segments, by connecting each sequential point in the timeseries. Here "optimal" means with least RMSE to the data, as this fit would have exactly 0 RMSE.

Ah, so the sliding window algorithm is very basic. There is no way to control the maximum number of returned segments, it merely makes a new segment when the error of a segment becomes "too large". In essence, it slides along the x-axis, appending data to a segment. When a user specified threshold is reached, it starts a new segment and keeps on sliding and adding data to this new segment. You can see the sub-optimality of this approach by looking at the attached figure at the trough: the yellow segment is a little wonky. The major advantage of this method is that it is fast (or can be made fast easily), as there is very little processing involved.

JuliaDynamics / TransitionsInTimeseries.jl

New IndicatorChangesConfig: Continuous Linear Segments #53