Drifts 23: simple linear model for drift detection

DavorJ commented 4 years ago

The simplest model for drift detection that I can make is the following (e.g. BAOL031X_A7250):

Let us decompose it in the following graphs:

The top graph is the just the difference plot with KNMI data. I'll use that in the future for convenience. (Selection of reference barometers + altitude compensation + theory that it is actually better than 1 reference barometer is for later.)
The middle left plot shows the best fit horizontal line model.
The middle right plot shows the best fit horizontal line + yearly seasonality model.
The bottom left plot is the best fit horizontal line model, but in this case we also allow the line to drift after a certain point in time. The vertical blue dotted line emphasizes this "most likely" moment.
The bottom right plot is what we are interested in. It is a best fit horizontal line + yearly seasonality model which is allowed to drift after a certain "most likely" moment.

There are a couple of limitations with this "simple" approach.

Changing variance in time (which would detect hysteresis) can not be modelled.
Increasing effect of seasonality can also not be modelled (as far as I can tell now). E.g. in BAOL004X has non-constant seasonality:
The statistical significance calculation is potentially very biased and thus untrustworthy in this model because of high correlation in the timeseries.

Can these issues be taken into consideration with a more complex model? Yes, but at this point I am not sure In which way to continue in terms of "time invested vs. value generated". There are a couple of modeling possibilities....

But the biggest problem is determining when the drift is significant. The blue vertical dotted line is only drawn in case of significance. (Currently very quick and dirty, but it at least gives an idea.) And as you can see in this overview, many of the barometers have some drift according to this model.

I wonder what you think @fredericpiesschaert, @mathiaswackenier and Piet from a user/business perspective? Any value in this?

fredericpiesschaert commented 4 years ago

Taking into account seasonal variance, it doesn't seem very useful to determine drift when the timeseries is less then one or even two years. That would eliminate these 'drift cases'

And, as I've said before, timeseries should be validated before presenting them to the drift function. We have to get rid of outliers and other anomalies, they really blur the picture:

I have to take a closer look at the examples, but it looks promising to me.

DavorJ commented 4 years ago

@fredericpiesschaert, concentrating on only series of more than 2 years is an option, but seems arbitrary to me. See #64: seems to work much better for short series due to a more complex model.

And yes, only validated data should be supplied to the function. An other option would be to use the _detectoutliers() filter before, but that one would potentially remove drift-information, so isn't an option.

fredericpiesschaert commented 4 years ago

@DavorJ 2 years is arbitrary indeed, I only meant that there are less drift cases than the model suggests and that it takes some common sense of the user to evaluate the model suggestions.

fredericpiesschaert commented 4 years ago

I find these graphs very interesting. Take a look at this one. There is no way you would suspect drift when looking at the original series, yet there seems to be something going on from the beginning. Does that mean you have to throw away the entire series. I would think not, but from what point on does drift become a problem? Not an easy one. It probably will be a user-decision?

DOV-Vlaanderen / groundwater-logger-validation

Drifts 23: simple linear model for drift detection #63