Performance vs DataFrames.jl #3

Open aplavin opened 1 year ago

In GitLab by @bkamins on Oct 8, 2022, 13:35

I run the following quick test:

using DataFrames
import FlexiJoins
df1 = DataFrame(a = 1:1_000_000);
df2 = shuffle(df1);
shuffle!(df1);
df1.id1 = axes(df1, 1);
df2.id2 = axes(df2, 1);
@time innerjoin(df1, df2, on=:a); # timing, after compilation is around 0.06 second
@time FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a)); # timing is around 0.4 seconds and more allocations

(I also checked joining on string columns and the results are similar)

Therefore I have one thought: maybe the cost is that too much is converted?

I looked at https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L44 and my comments for your consideration are:

DataFrame in the signature is too restrictive, I recommend AbstractDataFrame (so that views also can be passed for joins);
I think it would be more efficient to only pass columns on which you join and some indicator column (like my id1 and id2 columns above) and later use this indicator column to compose final data frame (this would probably avoid unnecessary movement of data)

I also have a question regarding output. Here is an MWE:

julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
 Row │ a      id1
     │ Int64  Int64
julia> df1 = DataFrame(a=1:2, id1=1:2)
2×2 DataFrame
 Row │ a      id1
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2

julia> df2 = DataFrame(a=[2, 1], id2=1:2)
2×2 DataFrame
 Row │ a      id2
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2

julia> innerjoin(df1, df2, on=:a)
2×3 DataFrame
 Row │ a      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2      2      1
   2 │     1      1      2

julia> FlexiJoins.innerjoin((df1, df2), FlexiJoins.by_key(:a))
2×4 DataFrame
 Row │ a      id1    a_1    id2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      1      1      2
   2 │     2      2      2      1

My question is why do you repeat column a_1 in the output (I understand it might be needed in more flexible join models, but in an equality join the values should be the same - right)?

Finally: what is the guaranteed row order (if there is one) of the produced output (i.e. does it follow left table?). Related to this I noticed that there is significant difference in timing of FlexiJoin.jl vs DataFrames.jl when working with unbalanced tables:

julia> using BenchmarkTools

julia> df1 = DataFrame(a = 1:10);

julia> df2 = DataFrame(a=1:10^8);

julia> @benchmark innerjoin($df1, $df2, on=:a)
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  110.431 ms … 123.599 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.138 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.592 ms ±   2.453 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃   ▁█▁  ▁▁
  ▄▄▁█▄▄▄███▄▁██▁▁▄▇▄▇▇▄▄▄▄▄▁▄▁▁▁▁▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  110 ms           Histogram: frequency by time          124 ms <

 Memory estimate: 9.94 KiB, allocs estimate: 148.

julia> @benchmark innerjoin($df2, $df1, on=:a)
BenchmarkTools.Trial: 44 samples with 1 evaluation.
 Range (min … max):  110.546 ms … 138.237 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.484 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   114.483 ms ±   4.388 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁ ▃▆▁█▃▃▁
  ▄█▇███████▇▄▄▁▄▁▁▄▁▁▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  111 ms           Histogram: frequency by time          138 ms <

 Memory estimate: 9.94 KiB, allocs estimate: 148.

julia> @benchmark FlexiJoins.innerjoin(($df1, $df2), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.214 s …   2.245 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.217 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.225 s ± 16.902 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    █                                                  █
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.21 s         Histogram: frequency by time        2.24 s <

 Memory estimate: 9.75 KiB, allocs estimate: 133.

julia> @benchmark FlexiJoins.innerjoin(($df2, $df1), FlexiJoins.by_key(:a))
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  2.257 s …   2.309 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.270 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.278 s ± 27.112 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █             █                                         █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.26 s         Histogram: frequency by time        2.31 s <

 Memory estimate: 9.75 KiB, allocs estimate: 133.

Sorry for a long list - I am reporting things I noticed when comparing the packages. I hope it will be useful.

In GitLab by @bkamins on Oct 8, 2022, 17:03

One more minor thing. At some point it would be also good to add metadata propagation for joins working on Tables.jl tables that support metadata (or at least for data frames). Again - if you want I can help you with this.

Re performance: Yes, looks like dataframes can join themselves by a column more efficiently than FlexiJoins. The difference seems smaller when the join key is selected from the whole integer range (like rand(Int, 10^6)), or when the "groups" with the same join key contain more than a single row. This isn't related to dataframe -> array conversion. Profiling shows that the majority of time is spent in Dict operations related to hash. By default, FlexiJoins uses a hash join for by_key, where first a dict is built for one side (https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/bykey.jl#L59-87), and then looked up for all elements from the other side (https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/bykey.jl#L89-94). More efficient strategies are likely possible, but not implemented yet.

DataFrame in the signature is too restrictive, I recommend AbstractDataFrame (so that views also can be passed for joins)

Easy, will fix.

I think it would be more efficient to only pass columns on which you join and some indicator column (like my id1 and id2 columns above) and later use this indicator column to compose final data frame (this would probably avoid unnecessary movement of data)

Not that easy... FlexiJoins doesn't really have any concept of "columns", neither internally nor in its user interface. by_key joins take two arbitrary functions f(object) = join_key, transforming objects from the left and right side to the key. Convenience constructors exist to (i) provide a single key function when it's the same for L and R, (ii) convert symbols (:a in your example) to property getters.

This dataframes join optimization would likely need special casing, like "if a dataframe is passed, and the by_key argument is a symbol, only pass these columns further".

My question is why do you repeat column a_1 in the output (I understand it might be needed in more flexible join models, but in an equality join the values should be the same - right)?

This is only relevant for joining DataFrames, so can be easily solved by removing duplicate columns in https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L47. Right? No one asked about that before, and I'm not that familiar with dataframes and expectations of their users myself.

When joining datasets in general (typically, some arrays), FlexiJoins doesn't modify the original objects in any way. As examples demonstrate, the result of joining A to B is a collection of namedtuples, each having an element of A as x.A, and the corresponding element of B as x.B. Clearly, here it doesn't make much sense to even talk about "repeating" or "dropping" a column.

Finally: what is the guaranteed row order (if there is one) of the produced output (i.e. does it follow left table?).

There is no guarantee (yet?). Maybe, more optimal join algorithms potentially output a different row order... Is the join order really important in practice?

For now, the result in simple by_key joins indeed has the order of the first (left) dataset, with nonmatches (in rightjoin) following at the very end.

One more minor thing. At some point it would be also good to add metadata propagation for joins working on Tables.jl tables that support metadata (or at least for data frames). Again - if you want I can help you with this.

Yeah, I noticed there was a lot of metadata discussion recently... I think it can be added for DataFrames quite easily by repopulating metadata after assembling the resulting dataframe at https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/FlexiJoins.jl#L47.

Not sure about Tables in general. The main FlexiJoins usecase for me is joining general collections, like arrays, that may or may not be Tables semantically. So my worry here is that potential metadata handling would interfere with collections even if the user doesn't explicitly opt into metadata anywhere.

Sorry for a long list - I am reporting things I noticed when comparing the packages. I hope it will be useful.

Thanks, that's great to see a new perspective on things here!

Re unbalanced join sides: Some performance can be gained by adding a fastpath when no indices are found in https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/src/ix_compute.jl#L10. Cursory testing shows about x1.5 improvement. But otherwise, profiling shows that more than half of time is spent in looking up a 10-element dict 10^8 times. Even this simple snippet takes much more time than DataFrames.jl join:

julia> @time let
       d = Dict{Int, Int}()
       for i in 1:10
           get!(d, i, i)
       end
       for i in 1:10^8
           d[mod1(i, 10)]
       end
   end;
  0.759574 seconds (4 allocations: 544 bytes)

So, clearly more efficient algorithms are possible, at least in some cases.

In GitLab by @bkamins on Oct 9, 2022, 15:39

This dataframes join optimization would likely need special casing, like "if a dataframe is passed, and the by_key argument is a symbol, only pass these columns further".

I meant to do this optimization in the methods with AbstractDataFrame already in the signature - where you anyway have a special method already.

In GitLab by @bkamins on Oct 9, 2022, 15:41

No one asked about that before, and I'm not that familiar with dataframes and expectations of their users myself.

This is natural in DataFrames.jl for equijoins, but as you allow flexijoins :smile:, probably it makes sense to keep both source columns even for keys (as they are not equal).

In GitLab by @bkamins on Oct 9, 2022, 15:44

Is the join order really important in practice?

In DataFrames.jl we currently do not guarantee row order - exactly like in FlexiJoins.jl, but users often ask about it so I wanted to know how the situation is in FlexiJoins.jl.

I think it can be added for DataFrames quite easily by repopulating metadata after assembling the resulting dataframe

Yes, adding metadata for data frames should be relatively easy. Other tables can be ignored for now (most of them do not support metadata yet).

In GitLab by @bkamins on Oct 9, 2022, 15:47

So, clearly more efficient algorithms are possible, at least in some cases.

This is what most work in DataFrames.jl was spent on. (but this is a never ending story, so as long as there are no huge differences I think this is not super high priority). In particular, regarding your comment about hashing cost, DataFrames.jl can join on columns that are either integer or pooled (like PooledArrays.jl or CategoricalArrays.jl) without having to do hashing in many cases.

I meant to do this optimization in the methods with AbstractDataFrame already in the signature - where you anyway have a special method already.

Sure, I just mean that extra special casing would be needed. This optimization could work with by_key(:prop), but not with by_key(x -> x.prop) or by_key(x -> x.prop // 10).

This is what most work in DataFrames.jl was spent on. (but this is a never ending story, so as long as there are no huge differences I think this is not super high priority). In particular, regarding your comment about hashing cost, DataFrames.jl can join on columns that are either integer or pooled (like PooledArrays.jl or CategoricalArrays.jl) without having to do hashing in many cases.

Optimizing for specific types of join keys (eg integer) is definitely possible in FlexiJoins as well. Ideally, though, these optimizations would go into the dictionary implementation, so that downstream users of that don't need explicit specializations.

As for specific array types, I thought about utilizing AcceleratedArrays somehow, but couldn't come up with a reasonably clean and general implementation. Meanwhile, FlexiJoins can cache the preparation result (Dict, sorter permutation, ...) when repeatedly joining multiple datasets to a single one. This isn't documented, but very useful for montecarlo simulations.

Thank you for reviewing this. In general - I think that the easiest most efficient way forward is for you just to decide what and how you want to incorporate in FlexiJoins.jl.

Thanks for measuring and reporting! The above is our discussion from 1.5 years ago, just transferred from gitlab.

In particular, <...> DataFrames.jl can join on columns <...> without having to do hashing in many cases.

I didn't know before that DataFrames implement lots of these performance optimizations, nice to know they are noticeably helpful. Such optimizations are definitely in scope for FlexiJoins, I just don't have any near-terms plans for implementing them. For my usage, anything that does joins much faster than O(n^2) is enough :)

I think it would be more efficient to only pass columns on which you join

This, on the other hand, is something I have plans to implement reasonably soon! Thanks to Accessors.jl, it's possible not just with plain column specifications by_key(:colname) but also with functions like by_key(@o year(_.datecol)).

JuliaAPlavin / FlexiJoins.jl

Performance vs DataFrames.jl #3