Open DavorJ opened 4 years ago
I'm afraid I can't make any meaningful contribution to the methodology that is used, but I see in the results that the conservative approach of the ARIMA-model seems better in detecting real drift and will result in less false positive cases. There seems to be no drift for example in these cases, but the simple model draws a breakpoint anyway:
can't figure out why it detects drift in some cases though:
you can't ignore seasonality if you want to detect drift properly, that is obvious in many examples:
you can't ignore seasonality if you want to detect drift properly, that is obvious in many examples.
Indeed.
The reason why drift is detected in some "not-so-straightforward" cases (as you point out) is because the linear model is too simple, and in case of optimal AR(1), because it "overestimates" its parameter for correlation (i.e. the AR -- auto regressive component) which subsequently results in "overestimation" of drift. It is "overestimation", thus not necessarily wrong. Even your not-so-straightforward case is drifting:
See the seasonality peaks at my blue markers? The black curve is a best fit -- assuming gaussian noise -- and according to that the series is drifting down....
To counterbalance the "overestimation", that is why I think/hope that fixing the AR component as in #65 will produce better results. (Theory is usually nicer than practice... :))
This analysis is similar to #63, but here we use a more complex ARIMA model.
Some explanation:
The advantage of AR(1) vs. #63 is that the series correlation is taken into account up to a certain point, making the significance calculation of drift and seasonality components more correct. So one straight-forward approach for _detectdrifts() function is to just check the significance and flag drifts based on that.
The weird thing is that in some cases, the effect of the AR(1) term is significantly less (down to non-existent) when drift/seasonality is taken into account. (I would expect the AR(1) effect to remain the same.) This is also a question for further modeling: whether we take the AR(1) parameter as fixed (i.e. compute it on good barometers) and then calculate seasonality/drift based on that, or do we allow it to change as best fit like in this analysis. Same with variance.
Analysis 27 pdf is a comparison of both.
Concentrate on bottom right plots of both the analyses. One can see that AR(1) is much more conservative in setting the blue vertical line than the simple linear model (#63). And in my opinion it seems to work better.
Some things that would need a more complex model: