JuliaTime / TimeZones.jl

IANA time zone database access for the Julia programming language
Other
87 stars 52 forks source link

Arrays of `DateTimes` in variable time zones #319

Open doorisajar opened 3 years ago

doorisajar commented 3 years ago

Broadcasting ZonedDateTime correctly converts a time series of naive DateTimes, unless the time zone is variable and there is a "fall back" in the time series.

The optional arguments for resolving this all work correctly for single DateTimes, but any single argument selection will give the wrong result for at least one timestamp when broadcast over a sorted array that crosses a fall back.

There doesn't seem to be a solution for this in TimeZones yet, unless I'm missing something. It seems like for cases where broadcasting doesn't work -- which I think are probably pretty common for users of TimeZones, they certainly are for me -- it would be useful to have a method that can handle arrays of sorted DateTimes. Maybe something like:

ZonedDateTime(datetimes::Array{DateTime,1}, tz::VariableTimeZone)

For a sorted 1D array of DateTimes crossing a fall back, there is enough information in the time series to resolve the ambiguity.

I'd be willing to contribute to a PR on this if folks agree it would be useful.

omus commented 3 years ago

Unfortunately there are a variety of answers that could be generated when converting local DateTimes into ZonedDateTime. You can use the occurrence argument of the ZonedDateTime constructor to work around the problem you describe but it may not provide the answer you want:

julia> using Dates, TimeZones

julia> wpg = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> collect(ZonedDateTime(2020,11,1,wpg):Hour(1):ZonedDateTime(2020,11,1,2,wpg))
4-element Array{ZonedDateTime,1}:
 2020-11-01T00:00:00-05:00
 2020-11-01T01:00:00-05:00
 2020-11-01T01:00:00-06:00
 2020-11-01T02:00:00-06:00

julia> x = DateTime(2020,11,1):Hour(1):DateTime(2020,11,1,3)
DateTime("2020-11-01T00:00:00"):Hour(1):DateTime("2020-11-01T03:00:00")

julia> ZonedDateTime.(x, wpg)
ERROR: AmbiguousTimeError: Local DateTime 2020-11-01T01:00:00 is ambiguous within America/Winnipeg
...

julia> ZonedDateTime.(x, wpg, 1)  # For an ambiguous case: select the first occurrence
4-element Array{ZonedDateTime,1}:
 2020-11-01T00:00:00-05:00
 2020-11-01T01:00:00-05:00
 2020-11-01T02:00:00-06:00
 2020-11-01T03:00:00-06:00

julia> ZonedDateTime.(x, wpg, 2)  # For an ambiguous case: select the first occurrence
4-element Array{ZonedDateTime,1}:
 2020-11-01T00:00:00-05:00
 2020-11-01T01:00:00-06:00
 2020-11-01T02:00:00-06:00
 2020-11-01T03:00:00-06:00

Getting the output show by the original ZonedDateTime example is harder. The best option is to use a range in that case. If you can provide a concrete example we may be able to come up with a solution.

doorisajar commented 3 years ago

I agree that it's not trivial, but if we know the time zone and know that the sequence is in order, we do have enough information to properly apply the variable time zone to each timestamp in the sequence.

Here's a short example that includes the spring ahead and fall back from last year. I included the spring ahead, just to make sure we don't break that while experimenting with handling fall back.

sa = vcat([DateTime("2020-03-08T00:00:00") + 30 * Minute(m) for m in 1:3], [DateTime("2020-03-08T02:30:00") + 30 * Minute(m) for m in 1:4])

fb = vcat([DateTime("2020-11-01T00:00:00") + 30 * Minute(m) for m in 1:4], [DateTime("2020-11-01T01:00:00") + 30 * Minute(m) for m in 1:4])

dts = vcat(sa, fb)
julia> dts
15-element Array{DateTime,1}:
 2020-03-08T00:30:00
 2020-03-08T01:00:00
 2020-03-08T01:30:00
 2020-03-08T03:00:00
 2020-03-08T03:30:00
 2020-03-08T04:00:00
 2020-03-08T04:30:00
 2020-11-01T00:30:00
 2020-11-01T01:00:00
 2020-11-01T01:30:00
 2020-11-01T02:00:00
 2020-11-01T01:30:00
 2020-11-01T02:00:00
 2020-11-01T02:30:00
 2020-11-01T03:00:00

One could envision examples like this at any timeseries resolution, or with ragged timeseries. An option to address it might be to use the internal API to compare first_valid and last_valid and look at the changes in that sequence:

julia> fv = TimeZones.first_valid.(dts, tz)
julia> lv = TimeZones.last_valid.(dts, tz)
julia> fv .!= lv
15-element BitArray{1}:
 0
 0
 0
 0
 0
 0
 0
 0
 1
 1
 0
 1
 0
 0
 0

[edit: removed an example I thought was working, but wasn't :) ]

Whether via cumulative sums or run length encoding (or other means), it should be possible to detect the regions of the sorted timeseries that need to receive special treatment, and apply the appropriate conversions in the appropriate places.

doorisajar commented 3 years ago

I'm not sure I understand the output of TimeZones.transition_range, but that might be an option -- get the range of possible ambiguity, then do one pass along the sorted timestamps appearing within that range and apply last_value to ones that have already elapsed according to their naive/unzoned timestamp.

I sketched out this starting point for identifying the window of ambiguity:


ranges = TimeZones.transition_range.(datetimes, tz, Local)

transitions = findall(length.(unique.(ranges)) .> 1)

fallback_window = first(miss):last(miss)

Outside fallback_window, we can apply ZonedDateTime to datetimes. Inside it is where special logic is needed. I've tested a couple of simple approaches, but don't have something that generalizes well enough yet.