Open nalimilan opened 3 years ago
I think it is a benchmarking artifact.
A small thing is that we could replace DataAPI.refpool.(cols)
with map(DataAPI.refpool, col)
as it should reduce the compilation time. But this is minor. Also as @timholy suggests - this function can be probably split up into several smaller functions depending on sort
and skipmissing
arguments.
Finally - and I believe this is a major potential gain: if sort=false
, skipmissing=false
and a single column is passed when refpool <: AbstractVector
then we can just check allunique(only(refpools))
and if all levels are used (most of the time they will be used in practice) and then have a fast path function.
Finally in DataAPI.jl we could add a function that for a given array returns information if refpool
is guaranteed to be unique (so that we do not have to run a check if allunique(only(refpools))
in that case.
Sorry for a long list of things, but in combination they potentially will improve speed in the most common case (my benchmarks show that at least ~10x speedup is to be gained in the common scenarios) + possibly also compile time latency.
I think it is a benchmarking artifact.
You mean that allocating a vector really takes about 25ms?
A small thing is that we could replace
DataAPI.refpool.(cols)
withmap(DataAPI.refpool, col)
as it should reduce the compilation time. But this is minor. Also as @timholy suggests - this function can be probably split up into several smaller functions depending onsort
andskipmissing
arguments.
Sure we can do that.
Finally - and I believe this is a major potential gain: if
sort=false
,skipmissing=false
and a single column is passed whenrefpool <: AbstractVector
then we can just checkallunique(only(refpools))
and if all levels are used (most of the time they will be used in practice) and then have a fast path function.
Unfortunately in my tests I couldn't find a faster implementation than the current one for the example above (i.e. with only 10 groups). This is what prompted me to file this issue. Even removing all the code handling multiple columns, skipmissing
and sort
, I get similar timings. And that doesn't feel right for it to be slower than combine(gd, :y => sum)
. How and in which cases do you get 10× speedups?
Finally in DataAPI.jl we could add a function that for a given array returns information if
refpool
is guaranteed to be unique (so that we do not have to run a check ifallunique(only(refpools))
in that case.
Yes that's https://github.com/JuliaData/DataFrames.jl/issues/2558 right?
You mean that allocating a vector really takes about 25ms?
No I mean that I have checked several options with running @time
many times (as this is essentially what matters) and they were similar. It is possible that in DataFrames.row_group_slots($((df.x,)), Val(false), $(Vector{Int}(undef, nrow(df))), false, false);
something is optimized out that should not be optimized out (but I might be wrong).
How and in which cases do you get 10× speedups?
This is the speedup:
julia> df = DataFrame(x=PooledArray(rand('a':'k', 10^8)));
julia> @time gdf = groupby(df, :x);
0.326293 seconds (47 allocations: 762.943 MiB, 18.72% gc time)
julia> gdf.groups == collect(Int, DataAPI.refarray(df.x))
true
julia> @time collect(Int, DataAPI.refarray(df.x));
0.247577 seconds (2 allocations: 762.940 MiB)
julia> @time copy(DataAPI.refarray(df.x));
0.027378 seconds (2 allocations: 95.368 MiB)
So:
collect(Int, DataAPI.refarray(df.x))
the speedup is only around 30% (still it is a lot IMO and I would go for it)gdf.groups
to be a vector of something else than Int
then it could be much cheaper as you see with copy(DataAPI.refarray(df.x))
(but I am not sure if it is worth it - but maybe yes as we can precompile things, we would have to check if later we do not get a slowdown in compile
if we allow gdf.groups
to hold something else than Int
)Yes that's #2558 right?
Right - I just wanted to keep all things in one place.
Update! I have checked PooledArrays.jl#master (which has refarray
eltype
set to UInt32
) and we get:
julia> @benchmark collect(Int, DataAPI.refarray($df.x))
BenchmarkTools.Trial:
memory estimate: 762.94 MiB
allocs estimate: 2
--------------
minimum time: 231.792 ms (0.14% GC)
median time: 300.978 ms (21.56% GC)
mean time: 296.998 ms (20.61% GC)
maximum time: 367.297 ms (34.96% GC)
--------------
samples: 17
evals/sample: 1
julia> @benchmark collect(UInt32, DataAPI.refarray($df.x))
BenchmarkTools.Trial:
memory estimate: 381.47 MiB
allocs estimate: 2
--------------
minimum time: 109.974 ms (0.00% GC)
median time: 113.834 ms (0.00% GC)
mean time: 146.304 ms (20.27% GC)
maximum time: 265.318 ms (49.11% GC)
--------------
samples: 35
evals/sample: 1
julia> @benchmark groupby($df, :x)
BenchmarkTools.Trial:
memory estimate: 762.94 MiB
allocs estimate: 46
--------------
minimum time: 271.016 ms (0.00% GC)
median time: 340.696 ms (19.26% GC)
mean time: 335.788 ms (18.19% GC)
maximum time: 403.206 ms (32.10% GC)
--------------
samples: 15
evals/sample: 1
so indeed groupby
has very little overhead. What we could do is to have two allowed values of gdf.groups
: UInt32
and Int
. With the default UInt32
and Int
as a fallback if we have extremely many groups (this fallback will not be triggered in practice often I think as it would require hundreds of GBs of RAM).
As you can see we can win ~50% of time (just by saving space with UInt32
vs Int
). This should allow us to move up in https://h2oai.github.io/db-benchmark/ significantly for 50GB tests even without multithreading.
We could even go for:
mutable struct GroupedDataFrame{T<:AbstractDataFrame, S<:Union{UInt32, Int}}
parent::T
cols::Vector{Symbol} # column names used for grouping
groups::Vector{S} # group indices for each row in 0:ngroups, 0 skipped
idx::Union{Vector{S}, Nothing} # indexing vector sorting rows into groups
starts::Union{Vector{S}, Nothing} # starts of groups after permutation by idx
ends::Union{Vector{S}, Nothing} # ends of groups after permutation by idx
ngroups::Int # number of groups
keymap::Union{Dict{Any, S}, Nothing} # mapping of key tuples to group indices
lazy_lock::Threads.ReentrantLock # lock is needed to make lazy operations
# thread safe
end
and set S
to UInt32
if nrow(parent) < typemax(UInt32)
(which in practice is almost always the case). This will significantly reduce memory consumption and should give speed up. What do you think?
The only challenge will be testing against Int
(but this should be doable).
What do you think?
Note that collect(Int, DataAPI.refarray(df.x))
or collect(DataAPI.refarray(df.x))
give an overly optimistic estimate of the speed we can actually reach, as we need to fill the seen
vector to keep track of which groups we have encountered. AFAIK it disables SIMD in the loop so the gain due to using UInt32
will be lower than with collect
. So we need to measure that first to make sure it would be worth it.
We could imagine adding a field to PooledArray
indicating whether all values are used, in which case we could avoid computing seen
. It would be set to true
on construction (when appropriate), and set to false
when calling setindex!
and other functions that can make a level unused. That would help when working e.g. with data imported from CSV, but not once you extract a subset.
One issue with using UInt32
is that for the dict-based fallback method, we don't know in advance how many groups there will be: so we would have to check that we don't reach the threshold, and if so copy everything to a new table. That should be pretty rare, but it makes the code more complex and as you noted it's hard to test (unless we also allow UInt8
).
as we need to fill the seen vector to keep track of which groups we have encountered.
agreed, but I assumed it would be fast (maybe I am wrong).
We could imagine adding a field to
PooledArray
indicating whether all values are used
I thought about it, but this is problematic, also with views (see https://github.com/JuliaData/PooledArrays.jl/issues/44 as a related issue).
we don't know in advance how many groups there will be:
This is not a problem I think. We can check how many rows a data frame has. If it is less than typemax(UInt32)
we are safe to assume there are at most that many groups.
Having said that - I think that the current code is quite fast. However, I believe that even if we keep the current code but switch it to UInt32
it will be faster (less memory movement and larger portions of data will fit into CPU cache). This will have an impact for cases with very many groups (but this is the case we are currently bad at in H2O benchmarks)
OK, it seems that using UInt32
indeed takes about 40% less time than Int
for the following benchmark:
using Revise, DataFrames, BenchmarkTools, PooledArrays
N = 100_000_000;
df = DataFrame(x=PooledArray(rand(1:10, N), UInt32), y=rand(N));
@btime groupby(df, :x);
I've pushed changes to test UInt32
to the nl/UInt32 branch if you want to have a look (it's not clean, just to make a quick test).
This is fantastic - I will have a look and report back. Also I will test how later combine
performance looks like (to make sure that we do not degrade the performance later). I assume you are using PooledArrays.jl#master (as it always uses UInt32
for refpool
which is relevant - that is why I pinged to tag a new release of PooledArrays.jl)
combine(gdf, :x => sum)
is tad faster on UInt32
(but what is important that it is not slower)For your gdf
we have:
julia> @time combine(gdf, :x => sum)
20.399369 seconds (276 allocations: 3.501 GiB, 2.46% gc time)
10×2 DataFrame
Row │ x x_sum
│ Int64 Int64
─────┼─────────────────
1 │ 2 20008186
2 │ 4 40003392
3 │ 1 9995654
4 │ 9 89992539
5 │ 8 79997288
6 │ 3 30012795
7 │ 6 59993004
8 │ 5 50011285
9 │ 10 99970470
10 │ 7 69987190
julia> @time combine(gdf, sdf -> sum(sdf.x))
1.487420 seconds (2.00 M allocations: 867.696 MiB, 19.10% gc time)
10×2 DataFrame
Row │ x x1
│ Int64 Int64
─────┼─────────────────
1 │ 2 20008186
2 │ 4 40003392
3 │ 1 9995654
4 │ 9 89992539
5 │ 8 79997288
6 │ 3 30012795
7 │ 6 59993004
8 │ 5 50011285
9 │ 10 99970470
10 │ 7 69987190
can you please cross check that? If this is not some bug on my side this means that we have a significant problem in fast reductions implementation.
EDIT: the performance problem is with sum
and mean
but not with e.g. nrow
or maximum
@nalimilan - is this issue boiled down to using UInt32
if we know gdf
is short? If yes - then probably we should change the title of the issue and mark in 1.x milestone? Right?
Yes I think so. Indeed it's not breaking so it can happen after 1.0. Though it could be worth implementing now in order to look good in benchmarks.
I had missed your previous comment. Indeed I also see the very bad performance of combine(gdf, :x => sum)
on my branch. I guess there's a type instability somewhere, which doesn't trigger allocations because it's inferred as Union{Int, UInt32}
.
OK - thank you for looking into this (both changing to Int32
and type instability).
At some point same thing can be done for joins, see https://github.com/JuliaData/DataFrames.jl/issues/2688.
I resolve the slowdown part in https://github.com/JuliaData/DataFrames.jl/pull/2708. The Uint32
part is something I assume you are looking at - is this correct @nalimilan? Thank you!
I have a mostly working implementation in the (updated) nl/UInt32 branch. But actually I'm not sure it's really worth the increased complexity. It's about 30% faster for groupby
, but given how fast that part is, that doesn't make a big difference even if the only thing you do with the result is call the optimized combine
to compute a sum.
using Revise, DataFrames, BenchmarkTools, PooledArrays
N = 100_000_000;
df = DataFrame(x=PooledArray(rand(1:10, N), UInt32), y=rand(N));
# main
julia> @btime groupby(df, :x);
29.111 ms (40 allocations: 76.30 MiB)
julia> @btime combine(groupby(df, :x), :y => sum);
61.132 ms (240 allocations: 76.31 MiB)
# branch:
julia> @btime groupby(df, :x);
21.233 ms (45 allocations: 38.15 MiB)
julia> @btime combine(groupby(df, :x), :y => sum);
49.567 ms (244 allocations: 38.16 MiB)
This was also my fear. Adding UInt32
support will increase code complexity and thus increase compilation latency, so maybe it is not worth in practice given that DataFrames.jl is mostly used interactively.
So what do we do. Close it for now and the same with #2688, or we keep them open for the future decision (then I would rename this issue to clearly indicate that using UInt32
is the thing that is to be done).
It seems
row_group_slots
for arrays implementingDataAPI.refpool
is much slower than it should. Profiling reveals that most of the time is spent on thegroups[i] = j
line. Yet, callingrow_group_slots
directly is twice faster than usinggroupby
. The difference seems to be due to the allocation of thegroups
vector, but allocating the array separately is very fast. Not sure whether that's a benchmarking artifact.This is not new as it also happens both on master and on 0.21.8, and both with Julia 1.5 and 1.6.