DOV-Vlaanderen / groundwater-logger-validation

Analysis on validation methods for groundwater logger data
MIT License
2 stars 2 forks source link

Timestamps #15

Closed DavorJ closed 5 years ago

DavorJ commented 5 years ago

Overview of percentage of data without a timestamp:

image

Timestamps are required for most time-series models (including ARIMA).

In case there are not timestamps, we could assume that the points are equally spaced in time and sequential in time stored in DB, but this is dangerous, as data with timestamps show:

For example BAOL093X_78679.csv data is not sequentially ordered after row 3024 (timestamp 2017-11-22 10:00:00).

image

For some reason, this time-series also shows high hysteresis after that time, but no drift? All points are equally (12h) spaced in time.

cc @fredericpiesschaert, @mathiaswackenier

DavorJ commented 5 years ago

An other assumption for ARIMA and most time-series models is that the points are equally spaced in time.

DYLS010X_B5259.csv has duplicate measurements on different timestamps:

image

What should we do with these cases?

mathiaswackenier commented 5 years ago

@DavorJ

@fredericpiesschaert could we generate a list with data loggers that have more than 2 measurements per diem so we can clean up the database?

DavorJ commented 5 years ago
  • What timestamp are you using? The occurrence date in UTC?

Indeed, DRME_OCR_UTC_DTE. Can we assume that DRME_OCR_UTC_DTE is always available? i.e. if it is partial or unavailable, then throw an error?

fredericpiesschaert commented 5 years ago

We can assume that it will be available. Nevertheless, proper error handling is necessary.

DavorJ commented 5 years ago

While I am at it, I wanted to see whether data comply to the equally spaced in time assumption.

I did not consider NA timestamps here.

image

The graph represents the entropy of the distribution of time differences between observations. It is a one-number-summary (since I wasn't able to come up with any better).

If it is 0, then the dataset contains all equally spaced observations. (i.e. entropy is low, no disorder). The larger the number, the more disorder in time distance between observations. (Disorder is not necessarily sequential).

For example: BAOL068X_180816.csv image

The DRME_OCR_UTC_DTE_DIFF is in seconds and represents the time-difference between sequential points. The difference is sometimes 11, sometimes 12 and sometimes 13 hours.

Compared to BAOL028X_B5467.csv which has a much lower entropy: image

The outlier is due to a large period without data.

For a more detailed overview see here.

For example:

[1] "BAOL057X_78680.csv"
Frequency table:
    0 43200 49463 
  365  1474     1 

0 second time-difference occurs 365 times -- these are just duplicates. 43200 second time-difference occurs 1474 times. 49463 second time-difference occurs 1 time.

To be discussed

fredericpiesschaert commented 5 years ago

BAOL068X is an unfortunate example because data where imported twice with a different timestamp. This needs to be fixed in the source database.