JuliaTime / TimeZones.jl

IANA time zone database access for the Julia programming language
Other
86 stars 51 forks source link

Add support for writing to the `tzfile` format #379

Closed omus closed 2 years ago

omus commented 2 years ago

As per #359 the plan was to use the tzfile format (or something similar) as the binary format we use to store time zone data as a simple binary serialization format. The changes here add that support add that support as well as change the current read_tzfile interface to be TZFile.read.

codecov-commenter commented 2 years ago

Codecov Report

Merging #379 (c4f36c8) into master (5e08518) will increase coverage by 0.39%. The diff coverage is 98.42%.

:exclamation: Current head c4f36c8 differs from pull request most recent head 46a781a. Consider uploading reports for the commit 46a781a to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   94.09%   94.48%   +0.39%     
==========================================
  Files          30       32       +2     
  Lines        1540     1649     +109     
==========================================
+ Hits         1449     1558     +109     
  Misses         91       91              
Impacted Files Coverage Δ
src/TimeZones.jl 100.00% <ø> (ø)
src/tzfile/read.jl 96.51% <96.51%> (ø)
src/local.jl 88.88% <100.00%> (ø)
src/tzfile/utils.jl 100.00% <100.00%> (ø)
src/tzfile/write.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5e08518...46a781a. Read the comment docs.

omus commented 2 years ago

Part of the work here was also to determine if the tzfile format was feasible for storing our internal time zone representation. I'll call out some of my analysis was incorrect but I'll post it here anyway as this was the journey I took:


The tzfile format is for the most part compatible with the internal representation we use for VariableTimeZone. One significant deviation is that the ttinfo struct uses a one-byte boolean (tt_isdst) to represent if a time zone transition is or is not DST. The TimeZones.jl package however allows the UTC offset and the DST offset to be specified independently allowing us greater flexibility in time zones we can support. When parsing the tzfile format as a VariableTimeZone we assume that each transition only changes the UTC offset or the DST offset but not both.

Let's validate that there exist time zones which change both of these offsets at once:

julia> using TimeZones

julia> tzif_incompatible(::FixedTimeZone) = false
tzif_incompatible (generic function with 1 method)

julia> tzif_incompatible(tz::VariableTimeZone) = tzif_incompatible_index(tz) !== nothing
tzif_incompatible (generic function with 2 methods)

julia> function tzif_incompatible_index(tz::VariableTimeZone)
           t, remaining = Iterators.peel(tz.transitions)
           std = t.zone.offset.std
           dst = t.zone.offset.dst

           for (i, t) in enumerate(remaining)
               if std != t.zone.offset.std && dst != t.zone.offset.dst
                   return i
               elseif std != t.zone.offset.std
                   std = t.zone.offset.std
               else
                   dst = t.zone.offset.dst
               end
           end

           return nothing
       end
tzif_incompatible_index (generic function with 1 method)

julia> filter(tzif_incompatible, all_timezones())
154-element Vector{TimeZone}:
 Africa/Algiers (UTC+1)
 Africa/Casablanca (UTC+0/UTC+1)
 Africa/El_Aaiun (UTC+0/UTC+1)
 Africa/Tripoli (UTC+2)
 ⋮
 US/Alaska (UTC-9/UTC-8)
 US/Aleutian (UTC-10/UTC-9)
 US/Indiana-Starke (UTC-6/UTC-5)
 W-SU (UTC+3)

julia> tz = first(ans)
Africa/Algiers (UTC+1)

julia> i = tzif_incompatible_index(tz)
28

julia> tz.transitions[i:i + 1]
2-element Vector{TimeZones.Transition}:
 1977-05-06T00:00:00 UTC+0/+1 (WEST)
 1977-10-20T23:00:00 UTC+1/+0 (CET)

Since there exists transitions where both of these offsets change at once we'll need to need to represent the DST offset as a value instead of a boolean to ensure a lossless conversion.

As the tt_isdst is stored as a one-byte value to store a boolean there are bits we can use to encode the DST offset value. As long as any reader of the tzfile format treats zero as not-DST and any non-zero value as DST it shouldn't matter what non-zero value is stored in this field. As we store seconds internally for the DST offset to match the UTC offset it would make sense to store the DST offset this way. However with one-byte we could only store up to 255 seconds which is far short of the typical 1-hour DST offset. Let's investigate what restrictions we need to adhere to:

julia> dst_negative(::FixedTimeZone) = false
dst_negative (generic function with 1 method)

julia> function dst_negative(tz::VariableTimeZone)
           for t in tz.transitions
               if t.zone.offset.dst < Second(0)
                   return true
               end
           end
           return false
       end
dst_negative (generic function with 2 methods)

julia> filter(dst_negative, all_timezones())  # Validate that negative DST offsets exist
7-element Vector{TimeZone}:
 Africa/Casablanca (UTC+0/UTC+1)
 Africa/El_Aaiun (UTC+0/UTC+1)
 Africa/Windhoek (UTC+2)
 Eire (UTC+0/UTC+1)
 Europe/Bratislava (UTC+1/UTC+2)
 Europe/Dublin (UTC+0/UTC+1)
 Europe/Prague (UTC+1/UTC+2)

julia> dst_offsets(tz::FixedTimeZone) = Set{Second}([tz.offset.dst])
dst_offsets (generic function with 2 methods)

julia> function dst_offsets(tz::VariableTimeZone)
           results = Set{Second}()
           for t in tz.transitions
                if t.zone.offset.dst in results
                    @show tz t.zone.offset.dst
                end
                push!(results, t.zone.offset.dst)
           end
           return results
       end
dst_offsets (generic function with 2 methods)

julia> offsets = mapreduce(dst_offsets, union, all_timezones())
Set{Second} with 7 elements:
  Second(-3600)
  Second(1200)
  Second(5400)
  Second(1800)
  Second(3600)
  Second(7200)
  Second(0)

julia> sort!(Minute.(offsets))  # All unique DST offsets that exist in tzdata
7-element Vector{Minute}:
 -60 minutes
 0 minutes
 20 minutes
 30 minutes
 60 minutes
 90 minutes
 120 minutes

Sticking to the one-byte value we could use 1-minute steps to represent the DST offset which would allow us to store offsets ±2 hours or with 10-minute steps ±42 hours.

After thinking about compatibility more the tzfile format stores both the UTC and DST offset as a single value and uses the tt_isdst boolean just to denote if this transition is considered DST. Storing the DST offset as I intended would result in typical tzfile consumers failing to notice any offset change at all as the tt_utoff field would remain the same. Maintaining compatibility would mean we'd have to perform the following operations to convert the tzfile format into the internal representation:

dst_offset = tt_isdst
utc_offset = tt_utoff - dst_offset

Encoding the DST offset this way should allow us to extract the UTC/DST offsets separately and still maintain backwards compatibility. However, I'd need to dig into this further as we may need to have some sort of bit flag in the tzfile which lets us determine if we are interpreting the file in this new way or using our existing heuristic which attempts to extract the DST offset from past information and the tt_isdst field (it's possible that these don't actually interfere with each other but they could).

As the tzfile also always includes the v1 in additional to the new version we can both reduce size and complexity by using a similar but custom file format. This also has the advantage letting us encode additional information such as the VariableTimeZone cutoff and cut out the currently unused leap-second information.

Since the tzdata will be stored in a separate Julia repo TimeZones.jl depends on we can make use of the typical semver usage to ensure format compatibility.


After getting TZFile.write functional I re-did some of this analysis and discovered that the heuristic was slightly better than I believed it to be in handling UTC/DST offset changes that occurred at the same time. As it turns out the real killer is sequential transitions where the UTC/DST offsets change and the previous DST offset and current DST offset are non-zero:

julia> using TimeZones

julia> tzif_incompatible(::FixedTimeZone) = false
tzif_incompatible (generic function with 1 method)

julia> tzif_incompatible(tz::VariableTimeZone) = tzif_incompatible_index(tz) !== nothing
tzif_incompatible (generic function with 2 methods)

julia> function tzif_incompatible_index(tz::VariableTimeZone)
           t, remaining = Iterators.peel(tz.transitions)
           prev_std = t.zone.offset.std
           prev_dst = t.zone.offset.dst

           for (i, t) in enumerate(remaining)
              if prev_std != t.zone.offset.std && prev_dst != t.zone.offset.dst && !iszero(prev_dst) && !iszero(t.zone.offset.dst)
                  return i + 1
               end
               prev_std = t.zone.offset.std
               prev_dst = t.zone.offset.dst
           end

           return nothing
       end
tzif_incompatible_index (generic function with 1 method)

julia> filter(tzif_incompatible, all_timezones())
3-element Vector{TimeZone}:
 Europe/Moscow (UTC+3)
 Europe/Paris (UTC+1/UTC+2)
 W-SU (UTC+3)

julia> i = tzif_incompatible_index(tz"Europe/Moscow")
9

julia> tz"Europe/Moscow".transitions[(i - 1):(i + 1)]
3-element Vector{TimeZones.Transition}:
 1919-05-31T19:28:41 UTC+02:31:19/+2 (MDST)
 1919-07-01T00:00:00 UTC+3/+1 (MSD)
 1919-08-15T20:00:00 UTC+3/+0 (MSK)

In validating the read/write round-trip was easy to determine which time zones include some incompatibility between the tzfile format and our internal representation:

julia> function tzif_version(tz)
           io = IOBuffer()
           TZFile.write(seekstart(io), tz)
           return TZFile.read(seekstart(io))(tz.name)
       end
tzif_version (generic function with 1 method)

julia> tzif_tz_compatible(tz) = tzif_version(tz) == tz
tzif_tz_compatible (generic function with 1 method)

julia> filter(!tzif_tz_compatible, all_timezones())
113-element Vector{TimeZone}:
 America/Argentina/Buenos_Aires (UTC-3)
 America/Argentina/Catamarca (UTC-3)
 America/Argentina/ComodRivadavia (UTC-3)
 America/Argentina/Cordoba (UTC-3)
 ⋮
 Israel (UTC+2/UTC+3)
 Pacific/Rarotonga (UTC-10)
 Portugal (UTC+0/UTC+1)
 W-SU (UTC+3)
omus commented 2 years ago

I'm going to forge ahead with this and make a release