JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Performance of indexing into a GroupedDataFrame #3102

Open cmey opened 2 years ago

cmey commented 2 years ago

Hi, really enjoying DataFrames.jl so far!

I've noticed a slight performance issue, and I propose a quick work around.

Assume we groupby a const dataset, and we then have to repeatedly index into the resulting GroupedDataFrame. My expectation was that indexing operation using an integer would be very fast, as fast as a array indexing, but in fact it is much slower. In a scenario where we do many more indexations than there are groups, it is more effective to precompute the indexing of GroupedDataFrame and store all the resulting SubDataFrame into an array, the latter which we then use to do the indexing inside the hot spot.

MWE:

using DataFrames

function benchmark()
    df = DataFrame(rand(100000, 100), :auto)
    gdf = groupby(df, :x1)
    inds = rand(1:length(gdf), 10^6)  # Many more indexations than groups.

    @time [gdf[i] for i in inds]  #  Indexing into GroupedDataFrame

    agdf = [sdf for sdf in gdf]
    @time [agdf[i] for i in inds]  # Compare to pre-built index
end

benchmark();

After running benchmark() twice:

0.265354 seconds (7.00 M allocations: 215.912 MiB)
0.007911 seconds (2 allocations: 45.776 MiB)

It makes sense that this is faster, but I was surprised by how much work is done in indexing into a GroupedDataFrame!

bkamins commented 2 years ago

The question is what do you propose? Do you propose to cache the visited SubDataFrames?

The reason why it is currently not implemented is that while we support this feature:

  1. it is easy to create cache yourself if you need it (as you just did with agdf = [sdf for sdf in gdf] one-liner).
  2. normally such indexing is not done in performance critical code (so in your example 0.26 seconds would not be considered a performance bottleneck).

However, indeed we could benchmark changing:

Base.getindex(gd::GroupedDataFrame, idx::Integer) =
    view(gd.parent, gd.idx[gd.starts[idx]:gd.ends[idx]], :)

to

Base.getindex(gd::GroupedDataFrame, idx::Integer) =
    view(gd.parent, view(gd.idx, gd.starts[idx]:gd.ends[idx]), :)

which should speed up things a bit (but maybe it would have a performance hit in other places - @nalimilan what do you think?). This would not resolve your issue though. The biggest cost is creation of SubDataFrame (which currently is created each time you ask for it). We could cache it, but such caching would have a memory overhead, so I would not do it unless there is a good reason for this (i.e. if really such indexing is heavily used in performance critical code and is a most expensive operation - I would normally assume that the operations you do with the result of indexing will be much more expensive than the indexing itself). So could you please explain your use case? Thank you!

cmey commented 2 years ago

The biggest cost is creation of SubDataFrame (which currently is created each time you ask for it)

Aha! That is the slightly unexpected performance part for me - I would have thought that would be somehow precomputed as part of the groupby already, so that afterwards indexing is super fast.

The question is what do you propose? Do you propose to cache the visited SubDataFrames?

At minima, I propose that, for performance reason, the user caches the SubDataFrames of a GroupedDataFrame, like I did. It would be awesome if DataFrames would do that automatically, but it is fine if not! Now, there is this issue to refer to :slightly_smiling_face:

Use case is for backtesting, I groupby by one of the columns, and revisit all groups repeatedly for every backtesting period. Indexing GroupedDataFrame was indeed the most expensive operation, more than computing the logic of the trading strategies, and precomputing the SubDataFrames like I did did speed up the entire backtesting run by 3x.

bkamins commented 2 years ago

be somehow precomputed

The indices are precomputed but not the SubDataFrame itself. The reason is that most of the time it is not needed. If you e.g. run combine(groupby(df, :somecol), :someothercol => mean)) even indices are not computed since it is possible to run this operation without computing them. GroupedDataFrame is optimized for this kind of operations as they are most common.

It would be awesome if DataFrames would do that automatically

@nalimilan - we could either do caching or just add do documentation an explanation of the situation and the example how one can pre-compute SubDataFrames?

nalimilan commented 2 years ago

It could indeed make sense to use view(gd.idx, gd.starts[idx]:gd.ends[idx]), :), as it would avoid copying indices without a significant performance penalty as they are contiguous. Though that would require having a SubDataFrame with a SubArray field, which would increase complexity. Benchmarking is needed to know whether it's worth it.

@bkamins What do you say that creating the SubDataFrame is the most costly part? Shouldn't it be almost free if we avoid copying indices?

Anyway I agree caching isn't desirable here as it could use a lot of memory for many common use cases where it's not useful, while collecting the GroupedDataFrame manually is very easy. Docs could be improved to mention this.

bkamins commented 2 years ago

What do you say that creating the SubDataFrame is the most costly part? Shouldn't it be almost free if we avoid copying indices?

This is what the benchmarking shows:

julia> df = DataFrame(x=1);

julia> gdf = groupby(df, :x);

julia> @btime $gdf[1];
  251.176 ns (2 allocations: 96 bytes)

julia> @btime $gdf.idx[$gdf.starts[1]:$gdf.ends[1]]
  164.372 ns (1 allocation: 64 bytes)
1-element Vector{Int64}:
 1

julia> @btime view($gdf.idx, $gdf.starts[1]:$gdf.ends[1])
  132.981 ns (0 allocations: 0 bytes)
1-element view(::Vector{Int64}, 1:1) with eltype Int64:
 1

so it seems that changing from a Vector to a view can give some benefit but not much (and later having the view might slow down the operations). Of course if groups were larger then the difference would be bigger, but in that case probably indexing is not a performance bottleneck.

nalimilan commented 2 years ago

I wonder how much we can trust benchmarks below 1 µs. The checkindex call when creating the SubDataFrame appears to be somewhat costly when run in isolation, but adding @inbounds doesn't makes a less clear difference. We could do it anyway as indices are supposed to be correct. I don't see what other calls could explain the cost of creating a SubDataFrame.