JuliaTime / TimeZones.jl

IANA time zone database access for the Julia programming language
Other
87 stars 52 forks source link

Simplify hash of VariableTimeZone #281

Closed oxinabox closed 4 years ago

oxinabox commented 4 years ago

I think the gain we get out of reducing collisions for the case of two timezones with same name but different transitions and cutoffs is not worth the extra time that hashing takes.

omus commented 4 years ago

I agree we can probably make this more performant but this is approach is definitely on the reckless side. If anything we should actually ignore the name and just look at the contents

iamed2 commented 4 years ago

this is approach is definitely on the reckless side

How often do you expect there to be multiple different time zones with the same name?

oxinabox commented 4 years ago

TimeZone names used in practice are almost always from the standard set

For isequal we are requiring equal name, + equal content. so for hash equal of any of the above is enough.

Equal content will have far more collisions as there is only a bit more than 24 basic offsets Might be 100 i guess with tthe different transitions

omus commented 4 years ago

There are a few tzdata named timezones like "MDT" that could defined by users. There are some plans to experiment time zones without pre-computed transitions which would definitely have hash collisions in that case. At a minimum we need to include the type in the hash here.

I think a less reckless way would be to precompute the hashes of the tzdata timezones which would eliminate the overhead and still support custom VariableTimeZones

iamed2 commented 4 years ago

Hash collisions are okay as long as they're not extremely common. It's one additional isequal call.

At a minimum we need to include the type in the hash here.

Why?

codecov-commenter commented 4 years ago

Codecov Report

Merging #281 into master will increase coverage by 1.16%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
+ Coverage   92.41%   93.58%   +1.16%     
==========================================
  Files          30       30              
  Lines        1398     1527     +129     
==========================================
+ Hits         1292     1429     +137     
+ Misses        106       98       -8     
Impacted Files Coverage Δ
src/types/variabletimezone.jl 100.00% <100.00%> (+9.09%) :arrow_up:
src/build.jl 100.00% <0.00%> (ø)
src/rounding.jl 100.00% <0.00%> (ø)
src/TimeZones.jl 100.00% <0.00%> (ø)
src/adjusters.jl 100.00% <0.00%> (ø)
src/interpret.jl 100.00% <0.00%> (ø)
src/conversions.jl 100.00% <0.00%> (ø)
src/tzdata/archive.jl 100.00% <0.00%> (ø)
src/types/fixedtimezone.jl 100.00% <0.00%> (ø)
src/utils.jl 97.56% <0.00%> (+0.06%) :arrow_up:
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf53361...5e05f2b. Read the comment docs.

omus commented 4 years ago

I'll explain some of my points in greater detail. Let's start off with the rule which under no circumstances should we break:

isequal(x,y) must imply that hash(x) == hash(y)

This is taken directly from the help for isequal. We shouldn't violate this rule (in reference to: https://github.com/JuliaTime/TimeZones.jl/pull/281#issuecomment-686763671)

Next,

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same. I bring this point up just as a consideration. Using strictly the name would work here for different types but two time zones that are the same will differing names would fail this check. To me this makes me thing that names should probably be ignored for the purposes of hashing (including names in the transitions).

Finally, definitely an unofficial rule:

Dissimilar types should not hash to be same value (e.g. Int vs. Char)

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

iamed2 commented 4 years ago

isequal(x,y) must imply that hash(x) == hash(y)

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y). In fact, this is inevitable if there are more unique values than there are UInts.

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

Not sure where you found this, as it's not in the help for isequal or hash in my Julia. This only applies when isequal(x, y) is true; this exists so that they are recorded as identical as keys a dictionary. I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

Putting the type in the hash seems fine to me then.

oxinabox commented 4 years ago

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same.

They are not the same value. isequals defines what we consider to be the same value. and isequal says that you need to have the same name. I can imagine that we might want to redefine isequal that way, but it is out of scope for this PR.

Ranges and vectors with the same value are considered equal by julia.

julia> isequal(1:4, [1,2,3,4])
true

Your second rule is just another way of saying the first.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't idea

Yeah, I didn't consider it a problem, but if you do, fair enough.

omus commented 4 years ago

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y)

You are correct here. I was thinking about why I thought (hash(x) == hash(y)) == isequal(x, y) and I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions. I usually make have made (hash(x) == hash(y)) == isequal(x, y) because of this. As VariableTimeZone isn't likely to be added as a column in a DataFrame it probably doesn't matter in this case.

I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

What about time zones with the same name but different transitions being stored in the same dictionary?

iamed2 commented 4 years ago

What about time zones with the same name but different transitions being stored in the same dictionary?

Those can be stored as different keys no problem, since isequal will be false. Collisions are fine, they are disambiguated by a cheap isequal check.

iamed2 commented 4 years ago

I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions

This is a serious bug if it exists, but I don't believe it does (anymore?). Looking at their code, they seem to use isequal to disambiguate, which is correct: https://github.com/JuliaData/DataFrames.jl/blob/fb4e184f3b3997da8680668fbf7fcc789811eb18/src/dataframerow/utils.jl#L126

omus commented 4 years ago

Collisions are fine, they are disambiguated by a cheap isequal check.

TIL. I definitely did not know this that hash collisions had a disambiguation check. This does change things:

julia> using TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> Dict(tz1 => 'a', tz2 => 'b')
Dict{VariableTimeZone,Char} with 2 entries:
  tz"America/Winnipeg"                      => 'a'
  VariableTimeZone("America/Winnipeg", ...) => 'b'

julia> hash(tz1) == hash(tz2)
true
omus commented 4 years ago

As we haven't done so yet I'll post a benchmark. Setup:

julia> using TimeZones, BenchmarkTools^C

julia> tz = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5

Current master

julia> @btime hash($tz)
  14.333 μs (187 allocations: 5.84 KiB)
0xe223c497ca74d779

This PR

julia> @btime hash($tz)
  17.395 ns (0 allocations: 0 bytes)
0x99df18304ac43999
omus commented 4 years ago

I'll fiddle with this implementation and DataFrames to try and ensure we're not missing anything. If those tests work out I think we can proceed with this.

omus commented 4 years ago

Nothing strange when working with DataFrames

julia> using DataFrames, TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> tz3 = deepcopy(tz1)
America/Winnipeg (UTC-6/UTC-5)

julia> df1 = DataFrame(:tz => tz1, :val => 1);

julia> df2 = DataFrame(:tz => tz2, :val => 2);

julia> df3 = DataFrame(:tz => tz3, :val => 3)

julia> innerjoin(df1, df2, on=:tz, makeunique=true)
0×3 DataFrame

julia> leftjoin(df1, df2, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1   │
│     │ VariableTimeZone               │ Int64 │ Int64?  │
├─────┼────────────────────────────────┼───────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ missing │

julia> outerjoin(df1, df2, on=:tz, makeunique=true)
2×3 DataFrame
│ Row │ tz                              │ val     │ val_1   │
│     │ VariableTimeZone                │ Int64?  │ Int64?  │
├─────┼─────────────────────────────────┼─────────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5)  │ 1       │ missing │
│ 2   │ America/Winnipeg (UTC-06:28:36) │ missing │ 2       │

julia> innerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1 │
│     │ VariableTimeZone               │ Int64 │ Int64 │
├─────┼────────────────────────────────┼───────┼───────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ 3     │

julia> leftjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1  │
│     │ VariableTimeZone               │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ 3      │

julia> outerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val    │ val_1  │
│     │ VariableTimeZone?              │ Int64? │ Int64? │
├─────┼────────────────────────────────┼────────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1      │ 3      │

julia> leftjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true

julia> innerjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true
omus commented 4 years ago

Will merge after CI completes