JuliaData / SplitApplyCombine.jl

Split-apply-combine strategies for Julia
Other
144 stars 15 forks source link

Store first value in Dict directly in innerjoin #29

Open non-Jedi opened 3 years ago

non-Jedi commented 3 years ago

I saw some of the discussion in https://github.com/JuliaData/DataFrames.jl/issues/2340 and got curious about what was possible.

This avoids allocating a Vector for the case where l does not have multiple indices with the same value. For the smoke-test benchmark in https://github.com/JuliaData/DataFrames.jl/issues/2340#issuecomment-668566508, this reduces allocations by half and overall runtime by 10%.

Most of the allocations still come from this line which it's much less clear how to reduce allocations in. I'm not sure how much https://github.com/JuliaLang/julia/issues/24909 affects performance in this case. One option would be to heuristically estimate the size of out based on the size of l and r and call sizehint! on it; this didn't seem to help in my testing.

I realize I'm only optimizing a single method of innerjoin, but I'm not super familiar with this field nor with the inner workings of this package, so I leave it to you to decide if this is a worthwhile complication and if it's relevant elsewhere in the package.

andyferris commented 2 years ago

Ah thanks.

Another strategy I tried a while back was to first assume the join keys are distinct and then bail to a more general implementation when that's not the case. It would be awesome to compare the performance vesus this.

(Sorry @non-Jedi I haven't been keeping track of PRs...)