deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.62k stars 163 forks source link

Estimating confidence that a breakpoint occurs in a 1D array #135

Closed spencerahill closed 3 years ago

spencerahill commented 3 years ago

I have a few hundred 1-D timeseries, each ~120 points. Each timeseries may or may not have one or more breakpoints caused by changes in instrumentation, with the timing and nature of the instrumentation change differing from timeseries to timeseries and not known in advance for any of them. I am interested in estimating some measure of confidence for each timeseries that at least one breakpoint exists for that timeseries (in addition to when that breakpoint occurs).

What would you recommend using for this? This sounds to me like a statistical test based on BinSeg with n=1 breakpoints, but I'm new to breakpoints overall and to the ruptures package and so it's not obvious to me if that's correct conceptually nor how to do that with ruptures. Apologies if I'm moving too quickly and thus missing something clear in the docs.

spencerahill commented 3 years ago

Edit: the fact that there are multiple timeseries is irrelevant; ultimately the question boils down to computing this for a single timeseries. (Since, c.f. #136, I can efficiently broadcast the calculation from one to all the rest of my timeseries). Thanks!

oboulant commented 3 years ago

Hi @spencerahill ,

Thanks again for your interest in ruptures !

You can indeed use the BinSeg search method. You can also use the Kernel change point detection with the Pelt algorithm search method (see here for an example).

Pelt allows you to compute an unknown number of change points.

As for the measure of confidence, not knowing anything about the underlying data, what I would do for a start is :

  1. For a several values of the Pelt penalty, compute change points using ruptures Kernel change point detection with the Pelt algorithm
  2. For each Pelt penalty, compute the returned number of change points
  3. Plot the graph (Pelt penalty value, number of returned change points)
  4. Study if there is a "discontinuity" in the plot and/or how hard it is to "constraint" the algorithm the return 0 change points with a higher penalty value.

Hope it is clear ?

Finally, since this is more of a discussion than an issue with the package, could you please move this to the discussion section ?

Olivier

deepcharles commented 3 years ago

I am interested in estimating some measure of confidence for each timeseries that at least one breakpoint exists for that timeseries (in addition to when that breakpoint occurs).

This is clearly something that ruptures should propose (or at least document). We'll work on it in the coming weeks.

This setting is known as AMOC (at most one change). If you have iid Gaussian data, you can have an exact p-value for the presence (or absence) of a change. See this recent article. For a simpler approach, you can use multiple hypothesis testing procedures (that control the false discovery rate or the familywise error rate). Thoses approaches can handle more complex data and provide p-values, but are approximate. You can also use AIC or BIC (or any other model selection methods) but those will not give you p-values, just a binary decision (change or no change).

I can dig up a few references for the multiple hypothesis testing procedures (which also happen to be a research interest of mine), but this will have to wait until next week.

Cheers

deepcharles commented 3 years ago

Note that all those procedures (except for the first one, the exact computation of a p-value) do not need much additional code.

spencerahill commented 3 years ago

Thanks both for the detailed responses.

This is clearly something that ruptures should propose (or at least document). We'll work on it in the coming weeks.

In light of this, perhaps it's OK to leave this as an Issue rather than move to Discussion? Either way fine just let me know.

This setting is known as AMOC (at most one change)

Perhaps this is just a nomenclature confusion on my part, but I'm interested in at least one change (is "ALOC" a thing?)

As for the measure of confidence, not knowing anything about the underlying data, what I would do for a start is :

  1. For a several values of the Pelt penalty, compute change points using ruptures Kernel change point detection with the Pelt algorithm
  2. For each Pelt penalty, compute the returned number of change points
  3. Plot the graph (Pelt penalty value, number of returned change points)
  4. Study if there is a "discontinuity" in the plot and/or how hard it is to "constraint" the algorithm the return 0 change points with a higher penalty value.

That makes sense. I'll let you know once I've given a try if I have any problems with it. In terms of the data, it's a mixed bag. The ones that don't have an obvious break by eye are pretty fairly Gaussian. But others by eye you can see a fundamental shift in both mean and variance leading to weird, sort-of bimodal distributions. And still others are intermediate.

deepcharles commented 3 years ago

Perhaps this is just a nomenclature confusion on my part, but I'm interested in at least one change (is "ALOC" a thing?)

If the task consists in labelling certain signals as "contain one change or more" vs "do not contain any change", I guess both setting are equivalent.

In any case, I will close this issue for now. Feel free to reopen.