Timestamps - Githubissues

DavorJ commented 5 years ago

Overview of percentage of data without a timestamp:

Timestamps are required for most time-series models (including ARIMA).

In case there are not timestamps, we could assume that the points are equally spaced in time and sequential in time stored in DB, but this is dangerous, as data with timestamps show:

For example BAOL093X_78679.csv data is not sequentially ordered after row 3024 (timestamp 2017-11-22 10:00:00).

For some reason, this time-series also shows high hysteresis after that time, but no drift? All points are equally (12h) spaced in time.

cc @fredericpiesschaert, @mathiaswackenier

DavorJ commented 5 years ago

An other assumption for ARIMA and most time-series models is that the points are equally spaced in time.

DYLS010X_B5259.csv has duplicate measurements on different timestamps:

What should we do with these cases?

mathiaswackenier commented 5 years ago

@DavorJ

What timestamp are you using? The occurrence date in UTC? Because Then it is normal that for some data loggers there is no time stamp as those measurements haven't been re-imported yet to obtain a UTC timestamp.
For data loggers with duplicate measurements on different timestamp the solution is easy: this needs to be cleaned up in the database. Timestamps have been manually edited in the past to 00:00 and 12:00 in order to be able to compensate and calibrate the measurements.

@fredericpiesschaert could we generate a list with data loggers that have more than 2 measurements per diem so we can clean up the database?

DavorJ commented 5 years ago

What timestamp are you using? The occurrence date in UTC?

Indeed, DRME_OCR_UTC_DTE. Can we assume that DRME_OCR_UTC_DTE is always available? i.e. if it is partial or unavailable, then throw an error?

fredericpiesschaert commented 5 years ago

We can assume that it will be available. Nevertheless, proper error handling is necessary.

DavorJ commented 5 years ago

While I am at it, I wanted to see whether data comply to the equally spaced in time assumption.

I did not consider NA timestamps here.

The graph represents the entropy of the distribution of time differences between observations. It is a one-number-summary (since I wasn't able to come up with any better).

If it is 0, then the dataset contains all equally spaced observations. (i.e. entropy is low, no disorder). The larger the number, the more disorder in time distance between observations. (Disorder is not necessarily sequential).

For example: BAOL068X_180816.csv

The DRME_OCR_UTC_DTE_DIFF is in seconds and represents the time-difference between sequential points. The difference is sometimes 11, sometimes 12 and sometimes 13 hours.

Compared to BAOL028X_B5467.csv which has a much lower entropy:

The outlier is due to a large period without data.

For a more detailed overview see here.

For example:

[1] "BAOL057X_78680.csv"
Frequency table:
    0 43200 49463 
  365  1474     1

0 second time-difference occurs 365 times -- these are just duplicates. 43200 second time-difference occurs 1474 times. 49463 second time-difference occurs 1 time.

To be discussed

fredericpiesschaert commented 5 years ago

BAOL068X is an unfortunate example because data where imported twice with a different timestamp. This needs to be fixed in the source database.

DOV-Vlaanderen / groundwater-logger-validation

Timestamps #15