JuliaData / DataTables.jl

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable
Other
29 stars 11 forks source link

WIP: Restore old grouping algorithm and improve it #76

Closed nalimilan closed 7 years ago

nalimilan commented 7 years ago

Follow the strategy used by Pandas. The new implementation is more efficient since it avoids creating a NullableCategoricalArray: the integer codes are combined on the fly with those computed from previous columns. Hashing only happens once by giving arbitrary codes to levels in the first pass; after that, only integer codes are used.

Move the per-column operations to separate functions which can be specialized by the compiler for each column type. This also allows using a more efficient method for CategoricalArray.

Fix ordering of CategoricalArray levels when levels have been reordered, and sort null values last for consistency with other nullable arrays. Enable sorting by default since its cost is relatively small compared with the rest.

Avoid some allocations by using in place operations, use Base.unique!().


This is WIP in particular because we need to implement the code to compress the codes when the cartesian product is going to overflow (see this code in Pandas). Also, tests for sorted and unsorted grouping should be adapted.

This branch cuts down the time to run this example from 10.7 seconds (similar to DataFrames) to 5.7 seconds.

using DataTables
N = 100_000_000
A = rand(N);
B = rand(1:100, N);
dt = DataTable(A = A, B = B);
@time by(dt, :B, d -> mean(d[:A]));
nalimilan commented 7 years ago

@alyst I've just added a mechanism to compress group indices when they would overflow. It was actually very simple: we just need to take the computed indices as if they were a grouping column. So every time overflow would happen, we have to pay a cost equivalent to grouping on an integer column.

The code now also always compresses the group indices at the end when there are multiple grouping columns, since we allocate a vector of a length equal to the number of groups. Without this, a memory error can happen in the indexer even if we avoided overflow in the step before. This seems to be what Pandas does (though I'm not completely sure), but in some cases it's clearly not useful, i.e. when there are few empty combinations. So I wonder whether a kind of threshold wouldn't be in order (e.g. compress only if there are more than half as many groups as rows).

I also realized that by moving to a UInt index rather than UInt32, it's quite hard to have enough groups to trigger an overflow. What was the use case which triggered it?

cjprybol commented 7 years ago

Thanks for working on this Milan! I read through the Discourse discussion and the comments on #17 and the reasons for switching back to the old approach make sense to me.

I've started working on a benchmark suite for DataTables again. Given how often people ask about performance, and that we missed the regressions on join in #17 (too excited about the improvements in groupby!), I think the sooner we can get some good, library-wide performance coverage the better. I'll try and get the tests for split-apply-combine related functionality done ASAP so we can compare these changes to the current approach from #17, and also so I can check https://github.com/JuliaData/DataTables.jl/pull/65.

nalimilan commented 7 years ago

It would really be great to have a series benchmarks. Without it it's hard to make systematic performance improvements. You can probably steal examples from Pandas and data.tables, they show benchmarks on their websites.

davidanthoff commented 7 years ago

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping also has some nice benchmarks.

nalimilan commented 7 years ago

I have found an alternative approach at https://github.com/JuliaData/DataTables.jl/pull/79 which seems to give similar or even better performance. Unfortunately, the issues with join are quite different so we need to find something else.

nalimilan commented 7 years ago

Closing in favor of #79.