Performance Tips for Categorical Data

jonas-schulze commented 5 years ago

In your performance tips at In [8] you mentioned:

Allowing missing as well as categorical slows down computations

I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on UInts:

julia> function StatsBase.countmap(x::CategoricalArray; alg = :auto)
           cm = countmap(x.refs, alg=alg)
           Dict(x.pool[ref] => count for (ref, count) in cm)
       end

julia> x = rand(1:10, 10^6);

julia> y = categorical(x);

julia> z = compress(y);

julia> @btime countmap(x);
  2.927 ms (8 allocations: 7.63 MiB)

julia> @btime countmap(y);
  2.141 ms (14 allocations: 3.82 MiB)

julia> @btime countmap(z);
  541.943 μs (11 allocations: 3.27 KiB)

On top of that, if we guard x.pool[ref] for ref == 0, which corresponds to a missing value, we also fix the performance loss for missing values.

My question is, where should I propose and contribute a change like the above, CategoricalArrays.jl or StatsBase.jl or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go to DataFrames.jl?

bkamins commented 5 years ago

countmap in this tutorial is just an example.

Regarding your proposal it should rather go to CategoricalArrays.jl as otherwise it would be type piracy (also because you use direct field access which should not be done outside CategoricalArrays.jl).

@nalimilan - do you think it is worth to add this specific method (maybe a review of other standard methods would be in place?).

jonas-schulze commented 5 years ago

countmap in this tutorial is just an example.

I know, but from the top of my head, it should also be trivial to optimize map, sum and prod.

it should rather go to CategoricalArrays.jl as otherwise it would be type piracy

Right. But how would one cope with the fact, that CategoricalArrays.jl otherwise does not rely on or need StatsBase.jl in any way. Every user of CategoricalArrays.jl would inherit that "blind" dependency.

If you give me some pointers on where to put these patches, I'd be happy to help. 🙂

nalimilan commented 5 years ago

FreqTables also has an optimized method, but I couldn't add it to StatsBase's countmap to avoid having it depend on CategoricalArrays. Maybe we could use @require, though. The same issue applies to PooledArrays.

bkamins commented 5 years ago

Having thought about it my conclusion is that it should be acceptable even to add StatsBase.jl as a dependency of CategoricalArrays.jl. The reason is that CategoricalArray is a statistics-oriented type.

This opposed to PooledArrays.jl which are general and should not depend on StatsBase.jl I think.

But in both cases we could consider using @require. My only question with this would be wouldn't then the behavior depend on code loading sequence?

nalimilan commented 5 years ago

But in both cases we could consider using @require. My only question with this would be wouldn't then the behavior depend on code loading sequence?

Not AFAIK, that's the strength of Requires.jl.

One thing to keep in mind is that StatsBase is supposed to be merged into Statistics at some point, and then it won't be possible to use @require I guess. But at that point it won't be a problem for CategoricalArrays and PooledArrays to depend on Statistics.

xiaodaigh commented 5 years ago

Does the invention of DataAPI.jl change anything here.

nalimilan commented 5 years ago

Yes, it should be possible to add an optimized code path using refarray.

bkamins / Julia-DataFrames-Tutorial

Performance Tips for Categorical Data #11