Open cmey opened 2 years ago
The question is what do you propose? Do you propose to cache the visited SubDataFrame
s?
The reason why it is currently not implemented is that while we support this feature:
agdf = [sdf for sdf in gdf]
one-liner).However, indeed we could benchmark changing:
Base.getindex(gd::GroupedDataFrame, idx::Integer) =
view(gd.parent, gd.idx[gd.starts[idx]:gd.ends[idx]], :)
to
Base.getindex(gd::GroupedDataFrame, idx::Integer) =
view(gd.parent, view(gd.idx, gd.starts[idx]:gd.ends[idx]), :)
which should speed up things a bit (but maybe it would have a performance hit in other places - @nalimilan what do you think?). This would not resolve your issue though. The biggest cost is creation of SubDataFrame
(which currently is created each time you ask for it). We could cache it, but such caching would have a memory overhead, so I would not do it unless there is a good reason for this (i.e. if really such indexing is heavily used in performance critical code and is a most expensive operation - I would normally assume that the operations you do with the result of indexing will be much more expensive than the indexing itself). So could you please explain your use case? Thank you!
The biggest cost is creation of SubDataFrame (which currently is created each time you ask for it)
Aha! That is the slightly unexpected performance part for me - I would have thought that would be somehow precomputed as part of the groupby
already, so that afterwards indexing is super fast.
The question is what do you propose? Do you propose to cache the visited SubDataFrames?
At minima, I propose that, for performance reason, the user caches the SubDataFrame
s of a GroupedDataFrame
, like I did.
It would be awesome if DataFrames would do that automatically, but it is fine if not! Now, there is this issue to refer to :slightly_smiling_face:
Use case is for backtesting, I groupby by one of the columns, and revisit all groups repeatedly for every backtesting period. Indexing GroupedDataFrame
was indeed the most expensive operation, more than computing the logic of the trading strategies, and precomputing the SubDataFrames
like I did did speed up the entire backtesting run by 3x.
be somehow precomputed
The indices are precomputed but not the SubDataFrame
itself. The reason is that most of the time it is not needed.
If you e.g. run combine(groupby(df, :somecol), :someothercol => mean))
even indices are not computed since it is possible to run this operation without computing them. GroupedDataFrame
is optimized for this kind of operations as they are most common.
It would be awesome if DataFrames would do that automatically
@nalimilan - we could either do caching or just add do documentation an explanation of the situation and the example how one can pre-compute SubDataFrame
s?
It could indeed make sense to use view(gd.idx, gd.starts[idx]:gd.ends[idx]), :)
, as it would avoid copying indices without a significant performance penalty as they are contiguous. Though that would require having a SubDataFrame
with a SubArray
field, which would increase complexity. Benchmarking is needed to know whether it's worth it.
@bkamins What do you say that creating the SubDataFrame
is the most costly part? Shouldn't it be almost free if we avoid copying indices?
Anyway I agree caching isn't desirable here as it could use a lot of memory for many common use cases where it's not useful, while collecting the GroupedDataFrame
manually is very easy. Docs could be improved to mention this.
What do you say that creating the SubDataFrame is the most costly part? Shouldn't it be almost free if we avoid copying indices?
This is what the benchmarking shows:
julia> df = DataFrame(x=1);
julia> gdf = groupby(df, :x);
julia> @btime $gdf[1];
251.176 ns (2 allocations: 96 bytes)
julia> @btime $gdf.idx[$gdf.starts[1]:$gdf.ends[1]]
164.372 ns (1 allocation: 64 bytes)
1-element Vector{Int64}:
1
julia> @btime view($gdf.idx, $gdf.starts[1]:$gdf.ends[1])
132.981 ns (0 allocations: 0 bytes)
1-element view(::Vector{Int64}, 1:1) with eltype Int64:
1
so it seems that changing from a Vector
to a view can give some benefit but not much (and later having the view might slow down the operations). Of course if groups were larger then the difference would be bigger, but in that case probably indexing is not a performance bottleneck.
I wonder how much we can trust benchmarks below 1 µs. The checkindex
call when creating the SubDataFrame
appears to be somewhat costly when run in isolation, but adding @inbounds
doesn't makes a less clear difference. We could do it anyway as indices are supposed to be correct. I don't see what other calls could explain the cost of creating a SubDataFrame
.
Hi, really enjoying DataFrames.jl so far!
I've noticed a slight performance issue, and I propose a quick work around.
Assume we
groupby
a const dataset, and we then have to repeatedly index into the resultingGroupedDataFrame
. My expectation was that indexing operation using an integer would be very fast, as fast as a array indexing, but in fact it is much slower. In a scenario where we do many more indexations than there are groups, it is more effective to precompute the indexing ofGroupedDataFrame
and store all the resultingSubDataFrame
into an array, the latter which we then use to do the indexing inside the hot spot.MWE:
After running
benchmark()
twice:It makes sense that this is faster, but I was surprised by how much work is done in indexing into a
GroupedDataFrame
!