EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

`pad` doesn't handle POSIXt as expected when by="day" crossing daylight savings time boundary #78

Closed TylerGrantSmith closed 3 years ago

TylerGrantSmith commented 3 years ago

It seems there are a few different problems that lead to POSIXt vectors not being handled well by pad when the dates cross a daylight savings bounary.

cross_dst <- data.frame(day = as.POSIXct(c('2020-03-08', '2020-03-09')))

# errors with "day"
padr::pad(cross_dst, interval = "day")
#> Error: The specified interval is invalid for the datetime variable.
#>   Not all original observation are in the padding.
#>   If you want to pad at this interval, aggregate the data first with thicken.

# different error with "DSTday"
padr::pad(cross_dst, interval = "DSTday")
#> Error in if (n > (threshold * 10^6)) {: missing value where TRUE/FALSE needed

Created on 2020-09-17 by the reprex package (v0.3.0)

In the first case, the issue is traced to

padr:::span_from_min_max_single
#> function (start, end, interval, id_vars) 
#> {
#>     ret <- data.frame(span = seq(start, end, by = interval))
#>     return(as.data.frame(cbind(ret, id_vars)))
#> }
#> <bytecode: 0x000000001f02dd08>
#> <environment: namespace:padr>

# the problem can be understood from this excerpt of `?seq.POSIXt`
#   
#  "The difference between "day" and "DSTday" is that the former ignores changes 
#  to/from daylight savings time and the latter takes the same clock time each
#  day. "week" ignores DST (it is a period of 144 hours), but "7 DSTdays" can be 
#  used as an alternative. "month" and "year" allow for DST."

# would expect both days to appear, but daylight savings messes it up
seq(as.POSIXct("2020-03-08"),as.POSIXct("2020-03-09"), "day")
#> [1] "2020-03-08 CST"

# here we see that one "day" is 86400 seconds later, which is 1am the 
# following day
seq(as.POSIXct("2020-03-08"),as.POSIXct("2020-03-10"), "day")
#> [1] "2020-03-08 00:00:00 CST" "2020-03-09 01:00:00 CDT"

# corrected
seq(as.POSIXct("2020-03-08"),as.POSIXct("2020-03-09"), "DSTday")
#> [1] "2020-03-08 CST" "2020-03-09 CDT"

The second issue is traced to padr:::convert_int_to_hours which doesn't accept DSTday and highlights a disagreement in the documentation for padr::pad which says about the interval parameter:

The interval of the returned datetime variable. Any character string that would be accepted by seq.Date() or seq.POSIXt. When NULL the the interval will be equal to the interval of the datetime variable. When specified it can only be lower than the interval and step size of the input data. See Details.

which is not the case (as contradicted in the Details).

The function padr:::get_return_rows is actually problematic in both cases, because the diffs produced in this case is "23 hours".

EdwinTh commented 3 years ago

First of all, thanks for your efforts to find the root of the problem.

Would you be so kind to explicitly assign the timezone to the POSIXt variables in your examples. I cannot reproduce the first error and I think it is because the default timezone on my system is different from yours, with a different moment of going into dst. Thanks!

TylerGrantSmith commented 3 years ago

Sorry about that!

cross_dst <- data.frame(day = as.POSIXct(c('2020-03-08', '2020-03-09'), 
                                         'America/Chicago'))

# errors with "day"
padr::pad(cross_dst, interval = "day")
#> Error: The specified interval is invalid for the datetime variable.
#>   Not all original observation are in the padding.
#>   If you want to pad at this interval, aggregate the data first with thicken.
# different error when not specified
padr::pad(cross_dst)
#> Error in seq.int(0, to0 - from, by): invalid '(to - from)/by'

Created on 2020-09-18 by the reprex package (v0.3.0)

EdwinTh commented 3 years ago

Now I can indeed reproduce, thanks. Will look into it when time allows.

EdwinTh commented 3 years ago

Did some more digging, problem occurs at weeks also

cross_week = data.frame(day = as.POSIXct(c('2020-03-08', '2020-03-22'), 
                                           'America/Chicago'))

padr::pad(cross_week, interval = "week")
EdwinTh commented 3 years ago

Fixed it, thanks again for your help!