DOV-Vlaanderen / groundwater-logger-validation

Analysis on validation methods for groundwater logger data
MIT License
2 stars 2 forks source link

Drifts 32/33: detect_drift() v01 #67

Open DavorJ opened 3 years ago

DavorJ commented 3 years ago

Here is an overview of diagnostic plots in pdf of the first (v01) _detectdrift() implementation. As reference KNMI data from #56 is used.

There are 3 colors for drift based on the p-value:

The diagnostic plots are the following:

image

Here is a csv file that contains all the p-values for reference. About 20% of the cases are colored red.

Comparison with older analysis #65 can be found here: part 1 and part 2.

One of my questions currently is what to do in case there is no reference data for assessment of drift. For example, in case of the above example of BAOL001X_D0939, data prior to 2010 is not taken into consideration because this data is missing in the KNMI series. Old data is usually not a problem, but for recent data it is. (e.g. BAOL_524X where KNMI is missing 2020 data). Should this result in a error from _detectdrift() or a warning? I tend to prefer the former.

DavorJ commented 3 years ago

Here is also a comparison between AR(1) = 0.9 (left) and AR(1) = 0.85 (right). You see that AR(1) = 0.85 model is more sensitive to drift and seasonality patterns, which allows for some "tuning". For example:

image

Which one do you prefer? Note that green cases will never be reported as drifting by default, while orange and red will be reported.

fredericpiesschaert commented 3 years ago

I prefer the 'safe' way and would want to see this reported (hence preference for the 0.85 model)

fredericpiesschaert commented 3 years ago

compare BAOL822 and BAOL823. These baro's are very close to each other in the field and have a very similar timeseries, but seasonality is only captured in 823. What could be the cause. Will the 0.85 model detect seasonality in 822? image

fredericpiesschaert commented 3 years ago

adding the date of the breakpoint in the output would be very helpfull

fredericpiesschaert commented 3 years ago

interesting: modeled breakpoints seem to be more conservative than 'visual' breakpoints, e.g. BAOL004X with visual detection at 30/01/2015 and model detection somwhere at the end of 2014. The visual validation clearly is based on the temperature peak. This could be OK. But on the other hand the membrane might already become unstable in the period before it finally 'crashes', which would be what the model detects. image image

fredericpiesschaert commented 3 years ago

'what to do in case there is no reference data for assessment of drift' --> an error from detect_drift() probably is the safest thing to do. We also still need to decide about the reference vs multicomparison modelling. In the latter case the problem shifts from 'no reference data' to 'there are only data from the series that needs to be evaluated'.

fredericpiesschaert commented 3 years ago

the difference with a set of reference data is really crucial for validating the baroseries. Apart from shift detection, it allows early problem detection, e.g. in the example below. We should consider consolidating these differences in the db and using it in the validation screens image

fredericpiesschaert commented 3 years ago

these are really nice examples of early drift detection. I don't think these would be noticed by just looking at the timeseries

image

image

fredericpiesschaert commented 3 years ago

axis of the right plots is now in days, perhaps months would be better here? I catch myself doing a recalculation to months each time.

fredericpiesschaert commented 3 years ago

is it possible to place the labels of the Y-axis horizontally in the frequency plot? There is a lot of overlap now in some cases (mainly the BAOL5***-series)

image

DavorJ commented 3 years ago

compare BAOL822 and BAOL823. These baro's are very close to each other in the field and have a very similar timeseries, but seasonality is only captured in 823. What could be the cause. Will the 0.85 model detect seasonality in 822?

The seasonality effect in BAOL822 isn't as strong as in BAOL823. You can see this in the DTFT plot where the peak intensity is at 0.1 vs 0.45. Here it is with seasonality:

image

So why not detected? It didn't pass the significance test of 1/100 (i.e. 1 out of 100 is expected to be wrongly detected.) Its significance is 0.019 (for 0.85 model), probably due to discrepancy at the start. An other way would be to just skip the significance testing and assume it is seasonal. Sometimes the effect will be very small and hardly visible anyway.

DavorJ commented 3 years ago

adding the date of the breakpoint in the output would be very helpful

I'll add it to the diagnostic plot too.

Date of the breakpoint is always added as an attribute to the output, together with other information if detect_drift(verbose=TRUE) is used. See here for full output. For example BAOL823X_P2_15607.attribs.

DavorJ commented 3 years ago

'what to do in case there is no reference data for assessment of drift' --> an error from detect_drift() probably is the safest thing to do. We also still need to decide about the reference vs multicomparison modelling. In the latter case the problem shifts from 'no reference data' to 'there are only data from the series that needs to be evaluated'.

Noted. 1 reference is currently the simplest approach, hence why it is used. A third option could be to just auto-fetch some series from somewhere online. @mathiaswackenier shared a good source for this I think during our last discussion.

DavorJ commented 3 years ago

is it possible to place the labels of the Y-axis horizontally in the frequency plot? There is a lot of overlap now in some cases (mainly the BAOL5***-series)

I'll look into that. Reason why it is vertical is Q&D: then the x-axes of the 3 stacked plots match without any extra adjustment. If they are horizontal, then the left margin will increase, depending on the size of the number: e.g. 1, 1000, 1000000, compressing the x-axis of the plot.

DavorJ commented 3 years ago

To this comparison I have added AR(1) = 0.85 with a fixed seasonal component as the third plot. Here are the reports: part1 and part2. These reports are an extension to this answer.

image

There is hardly any difference between AR(1) = 0.85 where seasonality is based on a significance test: drifts that were significant before still are.

So as far as I am concerned, we can keep it simple and always assume yearly seasonality, and let the model choose the best fit parameters. Think this is also more consistent for the user: he/she will always see the best fit seasonality pattern? What do you think @fredericpiesschaert?

fredericpiesschaert commented 3 years ago

I agree

DavorJ commented 3 years ago

Latest version with:

Here is the full output. It is based on the new (f6a2173) Westdorpe KNMI reference series.