Closed DavorJ closed 5 years ago
An other assumption for ARIMA and most time-series models is that the points are equally spaced in time.
DYLS010X_B5259.csv has duplicate measurements on different timestamps:
What should we do with these cases?
@DavorJ
@fredericpiesschaert could we generate a list with data loggers that have more than 2 measurements per diem so we can clean up the database?
- What timestamp are you using? The occurrence date in UTC?
Indeed, DRME_OCR_UTC_DTE
. Can we assume that DRME_OCR_UTC_DTE
is always available? i.e. if it is partial or unavailable, then throw an error?
We can assume that it will be available. Nevertheless, proper error handling is necessary.
While I am at it, I wanted to see whether data comply to the equally spaced in time assumption.
I did not consider NA
timestamps here.
The graph represents the entropy of the distribution of time differences between observations. It is a one-number-summary (since I wasn't able to come up with any better).
If it is 0, then the dataset contains all equally spaced observations. (i.e. entropy is low, no disorder). The larger the number, the more disorder in time distance between observations. (Disorder is not necessarily sequential).
For example: BAOL068X_180816.csv
The DRME_OCR_UTC_DTE_DIFF
is in seconds and represents the time-difference between sequential points. The difference is sometimes 11, sometimes 12 and sometimes 13 hours.
Compared to BAOL028X_B5467.csv which has a much lower entropy:
The outlier is due to a large period without data.
For a more detailed overview see here.
For example:
[1] "BAOL057X_78680.csv"
Frequency table:
0 43200 49463
365 1474 1
0 second time-difference occurs 365 times -- these are just duplicates. 43200 second time-difference occurs 1474 times. 49463 second time-difference occurs 1 time.
To be discussed
BAOL068X is an unfortunate example because data where imported twice with a different timestamp. This needs to be fixed in the source database.
Overview of percentage of data without a timestamp:
Timestamps are required for most time-series models (including ARIMA).
In case there are not timestamps, we could assume that the points are equally spaced in time and sequential in time stored in DB, but this is dangerous, as data with timestamps show:
For example BAOL093X_78679.csv data is not sequentially ordered after row 3024 (timestamp 2017-11-22 10:00:00).
For some reason, this time-series also shows high hysteresis after that time, but no drift? All points are equally (12h) spaced in time.
cc @fredericpiesschaert, @mathiaswackenier