ssfrr commented 1 year ago

I noticed that one of my data tables was taking forever to hash. I did some profiling and all the time was in stable_hash_helper and hash_foreach (I was using the vscode profiler though, so I don't have the specific methods that were being called).

Here's the code that reproduces the issue:

using CleanSleepUtils
using Dollars
using StableHashTraits

arousals = CleanSleepUtils.load_arousals()
@time stable_hash(arousals[1:100, :])

On my pet this is taking 0.2s for just 100 rows.

I ran it in Cthulhu.jl and it looks like there's a type instability in hash_foreach

You can see the output here:

So when you call f_x = fn(x) the return type is only inferred as Pair{Symbol}, which is abstract. Honestly I'm not sure what Pair{Symbol} means, given that Pair usually has 2 type parameters.

I wonder if the combination of calling a passed-in function on a symbol is too much for the compiler to follow. I think it would have to rely on constant propagation to do the field lookup in a type-stable way (it needs to know the symbol), and maybe it can only do that for simpler constructions like an explicit getfield.

palday commented 1 year ago

Pair{Symbol} is Pair{Symbol, T} where {T}

haberdashPI commented 1 year ago

Mmm... 🤔 , it's not obvious to me that the type instability here is the problem. Are you sure this is the method of hash_foreach that is the issue when profiling?

From what I can tell, there is a type instability when iterating over columns. This makes sense since the array of columns is inherently type unstable, since they have varying eltypes. I would have assumed that most of the time should be in the hash_foreach method that occurs within the column. It seems more of an issue if the per-column hash_foreach were type unstable, but it occurs after a function barrier so I would have guessed that it is type stable.

haberdashPI commented 1 year ago

Have you profiled it for larger DataFrames? It could be that the real bottleneck wouldn't show up without more rows, because of what I'm describing above.

ssfrr commented 1 year ago

The execution time seems to scale linearly with the number of elements for these tests, so I'm not just measuring initial overhead.

I split out the hashing for each column:

julia> times = DataFrame(column=names(a), type=[eltype(a[!, n]) for n in names(a)], dur=map((n->@elapsed stable_hash(a[1:500, n])), names(a)))
17×3 DataFrame
 Row │ column                  type                               dur
     │ String                  Type                               Float64
─────┼───────────────────────────────────────────────────────────────────────
   1 │ recording               UUID                               0.021891
   2 │ id                      UUID                               0.0217685
   3 │ span                    TimeSpan                           0.0771413
   4 │ display_icon_shortcode  Union{Missing, String}             0.00757008
   5 │ author_person           Union{Missing, UUID}               0.00758817
   6 │ author_process          Union{Missing, UUID}               0.0217512
   7 │ channels                Union{Missing, SubArray{String, …  0.00758956
   8 │ inserted_at             ZonedDateTime                      0.36969
   9 │ inserted_by             Union{Missing, UUID}               0.0220694
  10 │ last_updated_at         ZonedDateTime                      0.372024
  11 │ last_updated_by         Union{Missing, UUID}               0.0219271
  12 │ display_name            Union{Missing, String}             0.00756155
  13 │ rule                    Union{Missing, String}             0.00398748
  14 │ signal                  Union{Missing, String}             0.0039855
  15 │ cause                   Union{Missing, String}             0.00754496
  16 │ source_name             Union{Missing, String}             0.00397673
  17 │ source_id               UUID                               0.0217547

julia> barplot(times.column, times.dur)
                          ┌                                        ┐
                recording ┤■■ 0.021891
                       id ┤■■ 0.0217685
                     span ┤■■■■■■ 0.0771413
   display_icon_shortcode ┤■ 0.00757008
            author_person ┤■ 0.00758817
           author_process ┤■■ 0.0217512
                 channels ┤■ 0.00758956
              inserted_at ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.36969
              inserted_by ┤■■ 0.0220694
          last_updated_at ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.372024
          last_updated_by ┤■■ 0.0219271
             display_name ┤■ 0.00756155
                     rule ┤ 0.00398748
                   signal ┤ 0.0039855
                    cause ┤■ 0.00754496
              source_name ┤ 0.00397673
                source_id ┤■■ 0.0217547

So ZonedDateTime is definitely a problem!

I think there are two things going on here. One is that ZonedDateTime has a timezone::TimeZone field (which is abstract, but that's not the biggest problem right now). One common subtype is a VariableTimeZone, which is actually a pretty big object that contains a whole list of all the transitions (I'm guessing daylight savings, leap seconds maybe). Usually if there are multiple ZonedDateTimes with the same time zone, it is shared between them so it's not actually taking up all that memory.

You can probably see the problem though, which is that when we compute a hash, we iterate through that transitions table every time we hash a ZonedDateTime. One solution is to define a hash_method for VariableTimeZone that just hashes the name, under the assumption that there won't be multiple different time zones with the same name.

To save space I'm cut out the first runs of each function call that triggers compilation.

julia> zdt = [ZonedDateTime(2000, tz"Europe/Paris") for _ in 1:100];

julia> dts = [DateTime(2000) for _ in 1:100];

julia> @time stable_hash(dts);
  0.007940 seconds (50.91 k allocations: 3.585 MiB)

julia> @time stable_hash(zdts);
  1.914683 seconds (31.52 M allocations: 2.109 GiB, 2.62% gc time)

julia> StableHashTraits.hash_method(tz::VariableTimeZone) = (FnHash(qualified_name), FnHash(tz->tz.name));

julia> @time stable_hash(zdts);
  0.019814 seconds (187.31 k allocations: 12.840 MiB)

So that's a huge speedup, and I was feeling pretty good about myself at this point. Unfortunately, for reasonably-large tables this is still far too slow. To calibrate an upper bound to how fast hashing should be, I looked at a vector of integers

using BenchmarkTools

julia> is = collect(1:100);

julia> is_elapsed = @belapsed stable_hash(is)
4.6833e-5

julia> dts_elapsed = @belapsed stable_hash(dts)
0.003659833

julia> zdts_elapsed = @belapsed stable_hash(zdts)
0.011046416

julia> dts_elapsed/is_elapsed
78.14645655840967

julia> zdts_elapsed/is_elapsed
235.86821258514297

So a regular DateTime is 70x slower than an integer, and a ZonedDateTime is 235x slower. And what is a DateTime under the hood?

julia> dump(DateTime(2000))
DateTime
  instant: Dates.UTInstant{Dates.Millisecond}
    periods: Dates.Millisecond
      value: Int64 63082368000000

It's an integer.

I've got to wrap up for right now, but I suspect that either

the getfield struct machinery isn't as type-stable as we thought
When I was looking at some profiling results today I definitely saw a lot of qualified name lookup, so that might be more expensive than we thought.

If #1 is the problem, we may need to look into a @generated function to unroll the field lookup.

If #2 is the problem, we need to figure out a cheaper mechanism to disambiguate wrapper types.

Zooming out a bit, also perhaps we just shouldn't be hashing these tables with complicated Julia types in them, and we can hash the underlying Arrow data structures instead.

omus commented 1 year ago

You can probably see the problem though, which is that when we compute a hash, we iterate through that transitions table every time we hash a ZonedDateTime. One solution is to define a hash_method for VariableTimeZone that just hashes the name, under the assumption that there won't be multiple different time zones with the same name.

The Base.hash(::VariableTimeZone) does exactly this. I need to consider the use cases for hash_method to know if we should mirror that exact behaviour.

haberdashPI commented 1 year ago

I think a risk/danger of the more generic approach is that it will sometimes be hard to be faster than a specialized method. It seems pretty trivial to make a faster method for this particular type, so maybe that is the right approach here?

I have been thinking that an IdDict should be used per call to stable_hash to avoid recomputing hashes for the same object. (Since objects can change and have the same id it wouldn't make sense to save across calls). That might address the time zone issue you're seeing.

It is possible a similar approach would help for the type qualifiers.

haberdashPI commented 1 year ago

I've been experimenting with benchmarking / optimizing StableHashTraits. You can find a fast branch of the repo at dfl/optimize-1. The TL;DR is that compared to a raw call to e.g. sha256 speeds were ~100-1000x as slow with stable_hash before this branch and ~1-10x as slow after this branch.

I'm breaking that up into a series of smaller more digestible PRs.

You can find the first, which simply establishes benchmarks for some basic cases in #28

haberdashPI commented 1 year ago

29 is a follow-up PR, containing some initial optimizations

haberdashPI commented 1 year ago

The last PR #30 moves a number of computations to compile time, and speeds up stable_hash(x; alg)so that it is roughly ~x1-10 slower than alg(x) for the basic data types I'm benchmarking.

Note that none of this actually addresses the original issue raised by the time zone objects: caching turned out to not be very helpful for these benchmarks (but those seemed higher priority to me).

A future task is to implement the IdDict approach I propose above and verify that it helps with issues where a large object is referenced in many places.

haberdashPI commented 8 months ago

The final outstanding performance issue should be addressed by #55

beacon-biosignals / StableHashTraits.jl

Slow hashing of `DateTime`s, especially `ZonedDateTime`s #26

29 is a follow-up PR, containing some initial optimizations