JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 33 forks source link

Sparse categorical arrays? #374

Open gdalle opened 3 years ago

gdalle commented 3 years ago

Hey there, Thanks for this really nice package! In order to further decrease the memory footprint of my dataframes, I think a sparse version of CategoricalArray might be useful. I'm not really sure how to implement this yet, but I'm willing to take a look. Would you welcome a PR in this direction? Do you have any tips?

sprmnt21 commented 3 years ago

I have no particular contributions to offer, but I am interested in learning something about the subject and I would like to ask a few questions. I would like, for example, to have a more precise idea of the quantities involved. Can you show with examples what the size of your original database is now and how much would you like to reduce it to?

gdalle commented 3 years ago

Essentially I have thousands of columns with categorical entries and lots (> 90%) of missing values, so I thought it might be useful. However, it turns out that sparse columns in a dataframe are not a good idea (thanks @pdeffebach for this minimal not-working example):

julia> using DataFramesMeta, SparseArrays;

julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     0
   3 │     0
   4 │     0
   5 │     1

julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
  deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
  deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
  deleteat!(::BitVector, ::Any) at bitarray.jl:981
  ...
Stacktrace:
 [1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
 [2] delete!(df::DataFrame, inds::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
 [3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
 [4] top-level scope
   @ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681
bkamins commented 3 years ago

it turns out that sparse columns in a dataframe are not a good idea

This is unrelated with DataFrames.jl. The problem is with your column not defining deleteat!. If you add a method for this function for your column type all will work.

gdalle commented 3 years ago

But I guess there must be a good reason why deleteat! is not implemented for sparse vectors? Would there be a performance penalty?

pdeffebach commented 3 years ago

CategoricalArrays are already "sparse" in the sense that they don't take up the full memory. So I don't think this is needed.

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> sizeof(y)
16

julia> sizeof(x)
80000
bkamins commented 3 years ago

Would there be a performance penalty?

deleteat! could be implemented in general. I think it is that just no one had done this yet. The cost of this operation would not be that big (but of course it would be bigger than for vectors). You would need to do deleteat! for nzind and nzval fields which are Vector and then update indices in nzind for indices larger than the last deleted value.

don't take up the full memory. So I don't think this is needed.

But they do take memory "linear" to their size, while for sparse vectors the memory taken is proportional to non-zero entries. If data is very sparse there is a very big difference between these two scenarios.

Still for the reason that @pdeffebach raises probably this is not a top priority to do. However, your measure of memory footprint is incorrect. You should do:

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> z = categorical(x, compress=true);

julia> Base.summarysize(x)
80040

julia> Base.summarysize(y)
40584

julia> Base.summarysize(z)
10536
gdalle commented 3 years ago

Yes I agree with both of you. I just thought it would be a nice-to-have in my case, since categorical arrays still store one integer per value no matter what (the category rank). With compress = true this integer can be small, but maybe we can gain even more space by only remembering the indices where values are not missing. Of course this isn't a top priority. However, I imagine it wouldn't be too hard to implement, which is why I wouldn't mind having a go myself.

pdeffebach commented 3 years ago

I would file an issue in Base, as well, about re-sizing sparse arrays. See what they think.

bkamins commented 3 years ago

I also imagined it wouldn't be too hard to implement, which is why I considered having a go myself

I think it would not be that hard. You probably would need to specify what value you want to be treated as "zero" of sparse array.

However, I would propose to wait for @nalimilan to comment how he sees adding it to CategoricalArrays.jl from a general perspective of package maintenance.

gdalle commented 3 years ago

Yes, let's start with @pdeffebach's suggestion to shake things up in the stdlib

nalimilan commented 3 years ago

In theory it shouldn't be too hard to support sparse array-backed CategoricalArrays. This would mainly require adding another type parameter corresponding to the type of the refs field. Maybe some places would have to be adapted to use similar instead of allocating an Array (need to check). Since 0 is already used to represent #undef or missing values, a sparse array of integer values should work automatically.

You're welcome to experiment with this if that would be useful for you. Though I cannot promise I will accept the change if it turns out it increases the package's complexity too much. As you can see, the number of type parameters is already quite large... (One possible advantage of supporting any array type for refs could be that CarArrOrSub could be dropped in favor of having view return a CategoricalArray wrapping a SubArray, rather than the contrary like we do now.)

Out of curiosity, could you develop in what kind of situation you have this kind of data?

gdalle commented 3 years ago

My data points were generated by many semi-independent sub-systems, and not every sub-system logs the same columns, which means most columns are very sparse