Closed oxinabox closed 4 years ago
I agree we can probably make this more performant but this is approach is definitely on the reckless side. If anything we should actually ignore the name and just look at the contents
this is approach is definitely on the reckless side
How often do you expect there to be multiple different time zones with the same name?
TimeZone names used in practice are almost always from the standard set
For isequal
we are requiring equal name, + equal content.
so for hash equal of any of the above is enough.
Equal content will have far more collisions as there is only a bit more than 24 basic offsets Might be 100 i guess with tthe different transitions
There are a few tzdata named timezones like "MDT" that could defined by users. There are some plans to experiment time zones without pre-computed transitions which would definitely have hash collisions in that case. At a minimum we need to include the type in the hash here.
I think a less reckless way would be to precompute the hashes of the tzdata timezones which would eliminate the overhead and still support custom VariableTimeZone
s
Hash collisions are okay as long as they're not extremely common. It's one additional isequal
call.
At a minimum we need to include the type in the hash here.
Why?
Merging #281 into master will increase coverage by
1.16%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #281 +/- ##
==========================================
+ Coverage 92.41% 93.58% +1.16%
==========================================
Files 30 30
Lines 1398 1527 +129
==========================================
+ Hits 1292 1429 +137
+ Misses 106 98 -8
Impacted Files | Coverage Δ | |
---|---|---|
src/types/variabletimezone.jl | 100.00% <100.00%> (+9.09%) |
:arrow_up: |
src/build.jl | 100.00% <0.00%> (ø) |
|
src/rounding.jl | 100.00% <0.00%> (ø) |
|
src/TimeZones.jl | 100.00% <0.00%> (ø) |
|
src/adjusters.jl | 100.00% <0.00%> (ø) |
|
src/interpret.jl | 100.00% <0.00%> (ø) |
|
src/conversions.jl | 100.00% <0.00%> (ø) |
|
src/tzdata/archive.jl | 100.00% <0.00%> (ø) |
|
src/types/fixedtimezone.jl | 100.00% <0.00%> (ø) |
|
src/utils.jl | 97.56% <0.00%> (+0.06%) |
:arrow_up: |
... and 18 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update cf53361...5e05f2b. Read the comment docs.
I'll explain some of my points in greater detail. Let's start off with the rule which under no circumstances should we break:
isequal(x,y)
must imply thathash(x) == hash(y)
This is taken directly from the help for isequal
. We shouldn't violate this rule (in reference to: https://github.com/JuliaTime/TimeZones.jl/pull/281#issuecomment-686763671)
Next,
Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)
For VariableTimeZone
this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same. I bring this point up just as a consideration. Using strictly the name would work here for different types but two time zones that are the same will differing names would fail this check. To me this makes me thing that names should probably be ignored for the purposes of hashing (including names in the transitions).
Finally, definitely an unofficial rule:
Dissimilar types should not hash to be same value (e.g. Int vs. Char)
With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC")
which isn't ideal. We had a similar problem with hashing of ZonedDateTime
and there we added a seed to avoid collisions between hash(::DateTime)
and hash(::ZonedDateTime)
. This is why I brought up including the type in the hash.
isequal(x,y)
must imply thathash(x) == hash(y)
Under this rule it is perfectly okay to have hash(x) == hash(y)
and !isequal(x, y)
. In fact, this is inevitable if there are more unique values than there are UInt
s.
Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)
Not sure where you found this, as it's not in the help for isequal
or hash
in my Julia. This only applies when isequal(x, y)
is true; this exists so that they are recorded as identical as keys a dictionary. I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.
With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.
Putting the type in the hash seems fine to me then.
For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same.
They are not the same value. isequals
defines what we consider to be the same value.
and isequal
says that you need to have the same name
.
I can imagine that we might want to redefine isequal
that way, but it is out of scope for this PR.
Ranges and vectors with the same value are considered equal by julia.
julia> isequal(1:4, [1,2,3,4])
true
Your second rule is just another way of saying the first.
With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't idea
Yeah, I didn't consider it a problem, but if you do, fair enough.
Under this rule it is perfectly okay to have
hash(x) == hash(y)
and!isequal(x, y)
You are correct here. I was thinking about why I thought (hash(x) == hash(y)) == isequal(x, y)
and I believe it is because DataFrames used to (possibly still) uses hash
for joining and which they internally label as isequal
functions. I usually make have made (hash(x) == hash(y)) == isequal(x, y)
because of this. As VariableTimeZone
isn't likely to be added as a column in a DataFrame it probably doesn't matter in this case.
I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.
What about time zones with the same name but different transitions being stored in the same dictionary?
What about time zones with the same name but different transitions being stored in the same dictionary?
Those can be stored as different keys no problem, since isequal
will be false. Collisions are fine, they are disambiguated by a cheap isequal
check.
I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions
This is a serious bug if it exists, but I don't believe it does (anymore?). Looking at their code, they seem to use isequal
to disambiguate, which is correct: https://github.com/JuliaData/DataFrames.jl/blob/fb4e184f3b3997da8680668fbf7fcc789811eb18/src/dataframerow/utils.jl#L126
Collisions are fine, they are disambiguated by a cheap
isequal
check.
TIL. I definitely did not know this that hash collisions had a disambiguation check. This does change things:
julia> using TimeZones
julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)
julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)
julia> Dict(tz1 => 'a', tz2 => 'b')
Dict{VariableTimeZone,Char} with 2 entries:
tz"America/Winnipeg" => 'a'
VariableTimeZone("America/Winnipeg", ...) => 'b'
julia> hash(tz1) == hash(tz2)
true
As we haven't done so yet I'll post a benchmark. Setup:
julia> using TimeZones, BenchmarkTools^C
julia> tz = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5
Current master
julia> @btime hash($tz)
14.333 μs (187 allocations: 5.84 KiB)
0xe223c497ca74d779
This PR
julia> @btime hash($tz)
17.395 ns (0 allocations: 0 bytes)
0x99df18304ac43999
I'll fiddle with this implementation and DataFrames to try and ensure we're not missing anything. If those tests work out I think we can proceed with this.
Nothing strange when working with DataFrames
julia> using DataFrames, TimeZones
julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)
julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)
julia> tz3 = deepcopy(tz1)
America/Winnipeg (UTC-6/UTC-5)
julia> df1 = DataFrame(:tz => tz1, :val => 1);
julia> df2 = DataFrame(:tz => tz2, :val => 2);
julia> df3 = DataFrame(:tz => tz3, :val => 3)
julia> innerjoin(df1, df2, on=:tz, makeunique=true)
0×3 DataFrame
julia> leftjoin(df1, df2, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼─────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ missing │
julia> outerjoin(df1, df2, on=:tz, makeunique=true)
2×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64? │ Int64? │
├─────┼─────────────────────────────────┼─────────┼─────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ missing │
│ 2 │ America/Winnipeg (UTC-06:28:36) │ missing │ 2 │
julia> innerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64 │
├─────┼────────────────────────────────┼───────┼───────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> leftjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> outerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone? │ Int64? │ Int64? │
├─────┼────────────────────────────────┼────────┼────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> leftjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true
julia> innerjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true
Will merge after CI completes
I think the gain we get out of reducing collisions for the case of two timezones with same name but different transitions and cutoffs is not worth the extra time that hashing takes.