Open ssfrr opened 1 year ago
Pair{Symbol}
is Pair{Symbol, T} where {T}
Mmm... š¤ , it's not obvious to me that the type instability here is the problem. Are you sure this is the method of hash_foreach
that is the issue when profiling?
From what I can tell, there is a type instability when iterating over columns. This makes sense since the array of columns is inherently type unstable, since they have varying eltypes. I would have assumed that most of the time should be in the hash_foreach
method that occurs within the column. It seems more of an issue if the per-column hash_foreach
were type unstable, but it occurs after a function barrier so I would have guessed that it is type stable.
Have you profiled it for larger DataFrames? It could be that the real bottleneck wouldn't show up without more rows, because of what I'm describing above.
The execution time seems to scale linearly with the number of elements for these tests, so I'm not just measuring initial overhead.
I split out the hashing for each column:
julia> times = DataFrame(column=names(a), type=[eltype(a[!, n]) for n in names(a)], dur=map((n->@elapsed stable_hash(a[1:500, n])), names(a)))
17Ć3 DataFrame
Row ā column type dur
ā String Type Float64
āāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 ā recording UUID 0.021891
2 ā id UUID 0.0217685
3 ā span TimeSpan 0.0771413
4 ā display_icon_shortcode Union{Missing, String} 0.00757008
5 ā author_person Union{Missing, UUID} 0.00758817
6 ā author_process Union{Missing, UUID} 0.0217512
7 ā channels Union{Missing, SubArray{String, ā¦ 0.00758956
8 ā inserted_at ZonedDateTime 0.36969
9 ā inserted_by Union{Missing, UUID} 0.0220694
10 ā last_updated_at ZonedDateTime 0.372024
11 ā last_updated_by Union{Missing, UUID} 0.0219271
12 ā display_name Union{Missing, String} 0.00756155
13 ā rule Union{Missing, String} 0.00398748
14 ā signal Union{Missing, String} 0.0039855
15 ā cause Union{Missing, String} 0.00754496
16 ā source_name Union{Missing, String} 0.00397673
17 ā source_id UUID 0.0217547
julia> barplot(times.column, times.dur)
ā ā
recording ā¤ā ā 0.021891
id ā¤ā ā 0.0217685
span ā¤ā ā ā ā ā ā 0.0771413
display_icon_shortcode ā¤ā 0.00757008
author_person ā¤ā 0.00758817
author_process ā¤ā ā 0.0217512
channels ā¤ā 0.00758956
inserted_at ā¤ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā 0.36969
inserted_by ā¤ā ā 0.0220694
last_updated_at ā¤ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā 0.372024
last_updated_by ā¤ā ā 0.0219271
display_name ā¤ā 0.00756155
rule ā¤ 0.00398748
signal ā¤ 0.0039855
cause ā¤ā 0.00754496
source_name ā¤ 0.00397673
source_id ā¤ā ā 0.0217547
So ZonedDateTime
is definitely a problem!
I think there are two things going on here. One is that ZonedDateTime has a timezone::TimeZone
field (which is abstract, but that's not the biggest problem right now). One common subtype is a VariableTimeZone
, which is actually a pretty big object that contains a whole list of all the transitions (I'm guessing daylight savings, leap seconds maybe). Usually if there are multiple ZonedDateTime
s with the same time zone, it is shared between them so it's not actually taking up all that memory.
You can probably see the problem though, which is that when we compute a hash, we iterate through that transitions table every time we hash a ZonedDateTime
. One solution is to define a hash_method
for VariableTimeZone
that just hashes the name, under the assumption that there won't be multiple different time zones with the same name.
To save space I'm cut out the first runs of each function call that triggers compilation.
julia> zdt = [ZonedDateTime(2000, tz"Europe/Paris") for _ in 1:100];
julia> dts = [DateTime(2000) for _ in 1:100];
julia> @time stable_hash(dts);
0.007940 seconds (50.91 k allocations: 3.585 MiB)
julia> @time stable_hash(zdts);
1.914683 seconds (31.52 M allocations: 2.109 GiB, 2.62% gc time)
julia> StableHashTraits.hash_method(tz::VariableTimeZone) = (FnHash(qualified_name), FnHash(tz->tz.name));
julia> @time stable_hash(zdts);
0.019814 seconds (187.31 k allocations: 12.840 MiB)
So that's a huge speedup, and I was feeling pretty good about myself at this point. Unfortunately, for reasonably-large tables this is still far too slow. To calibrate an upper bound to how fast hashing should be, I looked at a vector of integers
using BenchmarkTools
julia> is = collect(1:100);
julia> is_elapsed = @belapsed stable_hash(is)
4.6833e-5
julia> dts_elapsed = @belapsed stable_hash(dts)
0.003659833
julia> zdts_elapsed = @belapsed stable_hash(zdts)
0.011046416
julia> dts_elapsed/is_elapsed
78.14645655840967
julia> zdts_elapsed/is_elapsed
235.86821258514297
So a regular DateTime
is 70x slower than an integer, and a ZonedDateTime
is 235x slower. And what is a DateTime
under the hood?
julia> dump(DateTime(2000))
DateTime
instant: Dates.UTInstant{Dates.Millisecond}
periods: Dates.Millisecond
value: Int64 63082368000000
It's an integer.
I've got to wrap up for right now, but I suspect that either
getfield
struct machinery isn't as type-stable as we thoughtIf #1
is the problem, we may need to look into a @generated
function to unroll the field lookup.
If #2
is the problem, we need to figure out a cheaper mechanism to disambiguate wrapper types.
Zooming out a bit, also perhaps we just shouldn't be hashing these tables with complicated Julia types in them, and we can hash the underlying Arrow data structures instead.
You can probably see the problem though, which is that when we compute a hash, we iterate through that transitions table every time we hash a
ZonedDateTime
. One solution is to define ahash_method
forVariableTimeZone
that just hashes the name, under the assumption that there won't be multiple different time zones with the same name.
The Base.hash(::VariableTimeZone)
does exactly this. I need to consider the use cases for hash_method
to know if we should mirror that exact behaviour.
I think a risk/danger of the more generic approach is that it will sometimes be hard to be faster than a specialized method. It seems pretty trivial to make a faster method for this particular type, so maybe that is the right approach here?
I have been thinking that an IdDict should be used per call to stable_hash
to avoid recomputing hashes for the same object. (Since objects can change and have the same id it wouldn't make sense to save across calls). That might address the time zone issue you're seeing.
It is possible a similar approach would help for the type qualifiers.
I've been experimenting with benchmarking / optimizing StableHashTraits. You can find a fast branch of the repo at dfl/optimize-1
. The TL;DR is that compared to a raw call to e.g. sha256
speeds were ~100-1000x as slow with stable_hash
before this branch and ~1-10x as slow after this branch.
I'm breaking that up into a series of smaller more digestible PRs.
You can find the first, which simply establishes benchmarks for some basic cases in #28
The last PR #30 moves a number of computations to compile time, and speeds up stable_hash(x; alg)
so that it is roughly ~x1-10 slower than alg(x)
for the basic data types I'm benchmarking.
Note that none of this actually addresses the original issue raised by the time zone objects: caching turned out to not be very helpful for these benchmarks (but those seemed higher priority to me).
A future task is to implement the IdDict approach I propose above and verify that it helps with issues where a large object is referenced in many places.
The final outstanding performance issue should be addressed by #55
I noticed that one of my data tables was taking forever to hash. I did some profiling and all the time was in
stable_hash_helper
andhash_foreach
(I was using the vscode profiler though, so I don't have the specific methods that were being called).Here's the code that reproduces the issue:
On my pet this is taking 0.2s for just 100 rows.
I ran it in Cthulhu.jl and it looks like there's a type instability in
hash_foreach
You can see the output here:
So when you call
f_x = fn(x)
the return type is only inferred asPair{Symbol}
, which is abstract. Honestly I'm not sure whatPair{Symbol}
means, given thatPair
usually has 2 type parameters.I wonder if the combination of calling a passed-in function on a symbol is too much for the compiler to follow. I think it would have to rely on constant propagation to do the field lookup in a type-stable way (it needs to know the symbol), and maybe it can only do that for simpler constructions like an explicit
getfield
.