Open non-Jedi opened 3 years ago
Ah thanks.
Another strategy I tried a while back was to first assume the join keys are distinct and then bail to a more general implementation when that's not the case. It would be awesome to compare the performance vesus this.
(Sorry @non-Jedi I haven't been keeping track of PRs...)
I saw some of the discussion in https://github.com/JuliaData/DataFrames.jl/issues/2340 and got curious about what was possible.
This avoids allocating a Vector for the case where
l
does not have multiple indices with the same value. For the smoke-test benchmark in https://github.com/JuliaData/DataFrames.jl/issues/2340#issuecomment-668566508, this reduces allocations by half and overall runtime by 10%.Most of the allocations still come from this line which it's much less clear how to reduce allocations in. I'm not sure how much https://github.com/JuliaLang/julia/issues/24909 affects performance in this case. One option would be to heuristically estimate the size of
out
based on the size ofl
andr
and callsizehint!
on it; this didn't seem to help in my testing.I realize I'm only optimizing a single method of
innerjoin
, but I'm not super familiar with this field nor with the inner workings of this package, so I leave it to you to decide if this is a worthwhile complication and if it's relevant elsewhere in the package.