Drifts 32/33: detect_drift() v01

DavorJ commented 4 years ago

Here is an overview of diagnostic plots in pdf of the first (v01) _detectdrift() implementation. As reference KNMI data from #56 is used.

There are 3 colors for drift based on the p-value:

Green: p-value is higher than 1/100: by default these cases will not be reported as drifting. A peculiar case here is BAOL005X_B5551 where, although there is an apparent drift for 5 years, due to high variance and slow decline, this decline could -- according to the AR(1) = 0.9 model -- also be "due to chance". Hence a high p-value.
Orange: p-value between 1/100 and 1/1000: these are border-cases. Currently and by default they are reported as drifting.
Red: p-value lower than 1/1000: no question that these cases are drifting according to the model. They are by default reported as drifting.

The diagnostic plots are the following:

Top plot is the time differences between measurements. This should mostly be 12h signified by the red horizontal line. Note the scale is logarithmic. The model works best on 12h intervals or smaller.
The middle left plot is the original series (black) overlaying the reference (darkred). In this case the reference is the KNMI data.
The bottom left plot is the difference between the original and the reference series and is best suited for drift detection. The vertical dotted line signifies the most likely drift start date. Subsequently, three colors are used based on the significance as explained above.
The right 4 smaller plots are the DTFT and yearly plots for the original series (top) and the differences with the reference (bottom). They are mainly used for the visualization of the seasonality patterns. Note also, that unlike in #61, here they are computed on drift-compensated series. This results in lower amplitudes for low frequencies (high periods) in DTFT plots and more pronounced yearly seasonality.

Here is a csv file that contains all the p-values for reference. About 20% of the cases are colored red.

Comparison with older analysis #65 can be found here: part 1 and part 2.

One of my questions currently is what to do in case there is no reference data for assessment of drift. For example, in case of the above example of BAOL001X_D0939, data prior to 2010 is not taken into consideration because this data is missing in the KNMI series. Old data is usually not a problem, but for recent data it is. (e.g. BAOL_524X where KNMI is missing 2020 data). Should this result in a error from _detectdrift() or a warning? I tend to prefer the former.

DavorJ commented 4 years ago

Here is also a comparison between AR(1) = 0.9 (left) and AR(1) = 0.85 (right). You see that AR(1) = 0.85 model is more sensitive to drift and seasonality patterns, which allows for some "tuning". For example:

Which one do you prefer? Note that green cases will never be reported as drifting by default, while orange and red will be reported.

fredericpiesschaert commented 4 years ago

I prefer the 'safe' way and would want to see this reported (hence preference for the 0.85 model)

fredericpiesschaert commented 4 years ago

compare BAOL822 and BAOL823. These baro's are very close to each other in the field and have a very similar timeseries, but seasonality is only captured in 823. What could be the cause. Will the 0.85 model detect seasonality in 822?

fredericpiesschaert commented 4 years ago

adding the date of the breakpoint in the output would be very helpfull

fredericpiesschaert commented 4 years ago

interesting: modeled breakpoints seem to be more conservative than 'visual' breakpoints, e.g. BAOL004X with visual detection at 30/01/2015 and model detection somwhere at the end of 2014. The visual validation clearly is based on the temperature peak. This could be OK. But on the other hand the membrane might already become unstable in the period before it finally 'crashes', which would be what the model detects.

fredericpiesschaert commented 4 years ago

'what to do in case there is no reference data for assessment of drift' --> an error from detect_drift() probably is the safest thing to do. We also still need to decide about the reference vs multicomparison modelling. In the latter case the problem shifts from 'no reference data' to 'there are only data from the series that needs to be evaluated'.

fredericpiesschaert commented 4 years ago

the difference with a set of reference data is really crucial for validating the baroseries. Apart from shift detection, it allows early problem detection, e.g. in the example below. We should consider consolidating these differences in the db and using it in the validation screens

fredericpiesschaert commented 4 years ago

these are really nice examples of early drift detection. I don't think these would be noticed by just looking at the timeseries

fredericpiesschaert commented 4 years ago

axis of the right plots is now in days, perhaps months would be better here? I catch myself doing a recalculation to months each time.

fredericpiesschaert commented 4 years ago

is it possible to place the labels of the Y-axis horizontally in the frequency plot? There is a lot of overlap now in some cases (mainly the BAOL5***-series)

DavorJ commented 4 years ago

compare BAOL822 and BAOL823. These baro's are very close to each other in the field and have a very similar timeseries, but seasonality is only captured in 823. What could be the cause. Will the 0.85 model detect seasonality in 822?

The seasonality effect in BAOL822 isn't as strong as in BAOL823. You can see this in the DTFT plot where the peak intensity is at 0.1 vs 0.45. Here it is with seasonality:

So why not detected? It didn't pass the significance test of 1/100 (i.e. 1 out of 100 is expected to be wrongly detected.) Its significance is 0.019 (for 0.85 model), probably due to discrepancy at the start. An other way would be to just skip the significance testing and assume it is seasonal. Sometimes the effect will be very small and hardly visible anyway.

DavorJ commented 4 years ago

adding the date of the breakpoint in the output would be very helpful

I'll add it to the diagnostic plot too.

Date of the breakpoint is always added as an attribute to the output, together with other information if detect_drift(verbose=TRUE) is used. See here for full output. For example BAOL823X_P2_15607.attribs.

DavorJ commented 4 years ago

'what to do in case there is no reference data for assessment of drift' --> an error from detect_drift() probably is the safest thing to do. We also still need to decide about the reference vs multicomparison modelling. In the latter case the problem shifts from 'no reference data' to 'there are only data from the series that needs to be evaluated'.

Noted. 1 reference is currently the simplest approach, hence why it is used. A third option could be to just auto-fetch some series from somewhere online. @mathiaswackenier shared a good source for this I think during our last discussion.

DavorJ commented 4 years ago

is it possible to place the labels of the Y-axis horizontally in the frequency plot? There is a lot of overlap now in some cases (mainly the BAOL5***-series)

I'll look into that. Reason why it is vertical is Q&D: then the x-axes of the 3 stacked plots match without any extra adjustment. If they are horizontal, then the left margin will increase, depending on the size of the number: e.g. 1, 1000, 1000000, compressing the x-axis of the plot.

DavorJ commented 4 years ago

To this comparison I have added AR(1) = 0.85 with a fixed seasonal component as the third plot. Here are the reports: part1 and part2. These reports are an extension to this answer.

There is hardly any difference between AR(1) = 0.85 where seasonality is based on a significance test: drifts that were significant before still are.

So as far as I am concerned, we can keep it simple and always assume yearly seasonality, and let the model choose the best fit parameters. Think this is also more consistent for the user: he/she will always see the best fit seasonality pattern? What do you think @fredericpiesschaert?

fredericpiesschaert commented 4 years ago

I agree

DavorJ commented 4 years ago

Latest version with:

improved non-overlapping y-axis labels on sampling rate (top-left) plot: labels show only the most frequent sampling rates
added drift timestamp to bottom-left plot
x-axis on right "yearly" plot is now in months (i.e. the number signifies the middle of a month, around the 15th day.) For the DTFT plot, it is bit more tricky since here I would need to treat months on a continuous scale (e.g. should a yearly peak be at the start of month 12, or 13, or somewhere in the middle?) So I left the plot for the time being in days.
DTFT plot now has x-axis labels on intensity peaks.

Here is the full output. It is based on the new (f6a2173) Westdorpe KNMI reference series.

DOV-Vlaanderen / groundwater-logger-validation

Drifts 32/33: detect_drift() v01 #67