DOV-Vlaanderen / groundwater-logger-validation

Analysis on validation methods for groundwater logger data
MIT License
2 stars 2 forks source link

outliers_dmst_cde validation #7

Closed DavorJ closed 5 years ago

DavorJ commented 5 years ago

We wanted to evaluate the DRME_DMST_CDE field taken from the database.

Here are the results.

The DEL and VLD category seem to reference the duplicates in data:

image image

The "DUPES=..." annotation is calculated based on duplicate timestamps. They match perfectly.

The INV seems to reference wrong data, but is inconsistent. Here some examples: image Why not the black point between the red points? And the two first red points do not seem to be outliers, given the others.

image Clearly outliers, but not flagged.

image What is wrong with these?

image Inconsistent.

image The second spike = OK, but the first?

image Inconsistent.

image Inconsistent.

image Not flagged...

To be discussed: the value of this field for validation.

DavorJ commented 5 years ago

I also just checked the BAOL086X_179793:

image INV=896, but they are not shown. This is because for these records, timestamps are missing. Plotting sequentially shows this:

image Here also:

image image All these red dots have no timestamp.

But not always as in the case of BAOL086X_179793: there some black points have no timestamp too!

To be discussed: INV is used with multiple meanings: no timestamp and as suspicious, but always inconsistent.

fredericpiesschaert commented 5 years ago

OK, so status code as it is seems of little value for validating the outlier procedure. We have two options:

this issue demonstrates the need for a controlled vocabulary

fredericpiesschaert commented 5 years ago

I think users also keep separate lists of 'suspicious data': (parts of) timeseries that should not be used for compensation, without being explicitely marked as invalid in the database. Is that correct @mathiaswackenier ?

mathiaswackenier commented 5 years ago

I think users also keep separate lists of 'suspicious data': (parts of) timeseries that should not be used for compensation, without being explicitely marked as invalid in the database. Is that correct @mathiaswackenier ?

We do keep these lists, but they are not written down. The WATINA-application forces the user to visually check the timeseries and by doing so we can easily detect suspicious data. There is also a second way how we detect suspicious data and that is during the compensation and calibration. Outliers or suspicious data in the timeseries of the barometric sensors will cause mistakes on the timeseries that are easily visually detected.

In short, the lists we have don't exist in hard-copy, but we are aware of which timeseries are unreliable.

fredericpiesschaert commented 5 years ago

I think the conclusion is that DRME_DMST_CDE is not useful for validating the algorithms