Drifts 26/27: best ARIMA models for drift detection

DavorJ commented 4 years ago

This analysis is similar to #63, but here we use a more complex ARIMA model.

Some explanation:

The middle left is the AR(1) model and middle right is AR(1) with seasonality component.
The bottom left is the AR(1) with a drift component, and bottom right is the AR(1) with drift and seasonality component.
The top left is the best ARMA model and top right is the best ARIMA model, both with drift (from bottom right model) and seasonality component. Their main importance here is an indication of the best ARMA/ARIMA model. They also clearly show that a random walk model (i.e. if the "I" term is used) is bad for modeling drift. Currently I have a preference for AR(1) -- it is relatively simple and has the most explanatory power.

The advantage of AR(1) vs. #63 is that the series correlation is taken into account up to a certain point, making the significance calculation of drift and seasonality components more correct. So one straight-forward approach for _detectdrifts() function is to just check the significance and flag drifts based on that.

The weird thing is that in some cases, the effect of the AR(1) term is significantly less (down to non-existent) when drift/seasonality is taken into account. (I would expect the AR(1) effect to remain the same.) This is also a question for further modeling: whether we take the AR(1) parameter as fixed (i.e. compute it on good barometers) and then calculate seasonality/drift based on that, or do we allow it to change as best fit like in this analysis. Same with variance.

Analysis 27 pdf is a comparison of both.

Concentrate on bottom right plots of both the analyses. One can see that AR(1) is much more conservative in setting the blue vertical line than the simple linear model (#63). And in my opinion it seems to work better.

Some things that would need a more complex model:

Difference in measurement intervals: most barometers measure in 12h intervals, but some also in 24h intervals. Mixing requires a more complex custom model. Missing measurements (and thus possibly large intervals), as far as there are only a few, shouldn't be a problem.
Hysteresis: increasing variance detection would also require a more complex custom model.
Fixing a certain parameter, such as AR(1) can only be done with a custom model.
Increasing-in-time seasonality component would require a custom model.

fredericpiesschaert commented 4 years ago

I'm afraid I can't make any meaningful contribution to the methodology that is used, but I see in the results that the conservative approach of the ARIMA-model seems better in detecting real drift and will result in less false positive cases. There seems to be no drift for example in these cases, but the simple model draws a breakpoint anyway:

fredericpiesschaert commented 4 years ago

can't figure out why it detects drift in some cases though:

fredericpiesschaert commented 4 years ago

you can't ignore seasonality if you want to detect drift properly, that is obvious in many examples:

DavorJ commented 4 years ago

you can't ignore seasonality if you want to detect drift properly, that is obvious in many examples.

Indeed.

The reason why drift is detected in some "not-so-straightforward" cases (as you point out) is because the linear model is too simple, and in case of optimal AR(1), because it "overestimates" its parameter for correlation (i.e. the AR -- auto regressive component) which subsequently results in "overestimation" of drift. It is "overestimation", thus not necessarily wrong. Even your not-so-straightforward case is drifting:

See the seasonality peaks at my blue markers? The black curve is a best fit -- assuming gaussian noise -- and according to that the series is drifting down....

To counterbalance the "overestimation", that is why I think/hope that fixing the AR component as in #65 will produce better results. (Theory is usually nicer than practice... :))

DOV-Vlaanderen / groundwater-logger-validation

Drifts 26/27: best ARIMA models for drift detection #64