Open gdalle opened 3 years ago
I have no particular contributions to offer, but I am interested in learning something about the subject and I would like to ask a few questions. I would like, for example, to have a more precise idea of the quantities involved. Can you show with examples what the size of your original database is now and how much would you like to reduce it to?
Essentially I have thousands of columns with categorical entries and lots (> 90%) of missing
values, so I thought it might be useful. However, it turns out that sparse columns in a dataframe are not a good idea (thanks @pdeffebach for this minimal not-working example):
julia> using DataFramesMeta, SparseArrays;
julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 0
3 │ 0
4 │ 0
5 │ 1
julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
deleteat!(::BitVector, ::Any) at bitarray.jl:981
...
Stacktrace:
[1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
[2] delete!(df::DataFrame, inds::Vector{Int64})
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
[3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
@ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
[4] top-level scope
@ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681
it turns out that sparse columns in a dataframe are not a good idea
This is unrelated with DataFrames.jl. The problem is with your column not defining deleteat!
. If you add a method for this function for your column type all will work.
But I guess there must be a good reason why deleteat!
is not implemented for sparse vectors? Would there be a performance penalty?
CategoricalArrays are already "sparse" in the sense that they don't take up the full memory. So I don't think this is needed.
julia> using CategoricalArrays
julia> x = rand(1:5, 10_000);
julia> y = categorical(x);
julia> sizeof(y)
16
julia> sizeof(x)
80000
Would there be a performance penalty?
deleteat!
could be implemented in general. I think it is that just no one had done this yet. The cost of this operation would not be that big (but of course it would be bigger than for vectors). You would need to do deleteat!
for nzind
and nzval
fields which are Vector
and then update indices in nzind
for indices larger than the last deleted value.
don't take up the full memory. So I don't think this is needed.
But they do take memory "linear" to their size, while for sparse vectors the memory taken is proportional to non-zero entries. If data is very sparse there is a very big difference between these two scenarios.
Still for the reason that @pdeffebach raises probably this is not a top priority to do. However, your measure of memory footprint is incorrect. You should do:
julia> using CategoricalArrays
julia> x = rand(1:5, 10_000);
julia> y = categorical(x);
julia> z = categorical(x, compress=true);
julia> Base.summarysize(x)
80040
julia> Base.summarysize(y)
40584
julia> Base.summarysize(z)
10536
Yes I agree with both of you. I just thought it would be a nice-to-have in my case, since categorical arrays still store one integer per value no matter what (the category rank). With compress = true
this integer can be small, but maybe we can gain even more space by only remembering the indices where values are not missing.
Of course this isn't a top priority. However, I imagine it wouldn't be too hard to implement, which is why I wouldn't mind having a go myself.
I would file an issue in Base, as well, about re-sizing sparse arrays. See what they think.
I also imagined it wouldn't be too hard to implement, which is why I considered having a go myself
I think it would not be that hard. You probably would need to specify what value you want to be treated as "zero" of sparse array.
However, I would propose to wait for @nalimilan to comment how he sees adding it to CategoricalArrays.jl from a general perspective of package maintenance.
Yes, let's start with @pdeffebach's suggestion to shake things up in the stdlib
In theory it shouldn't be too hard to support sparse array-backed CategoricalArrays
. This would mainly require adding another type parameter corresponding to the type of the refs
field. Maybe some places would have to be adapted to use similar
instead of allocating an Array
(need to check). Since 0
is already used to represent #undef
or missing
values, a sparse array of integer values should work automatically.
You're welcome to experiment with this if that would be useful for you. Though I cannot promise I will accept the change if it turns out it increases the package's complexity too much. As you can see, the number of type parameters is already quite large... (One possible advantage of supporting any array type for refs
could be that CarArrOrSub
could be dropped in favor of having view
return a CategoricalArray
wrapping a SubArray
, rather than the contrary like we do now.)
Out of curiosity, could you develop in what kind of situation you have this kind of data?
My data points were generated by many semi-independent sub-systems, and not every sub-system logs the same columns, which means most columns are very sparse
Hey there, Thanks for this really nice package! In order to further decrease the memory footprint of my dataframes, I think a sparse version of
CategoricalArray
might be useful. I'm not really sure how to implement this yet, but I'm willing to take a look. Would you welcome a PR in this direction? Do you have any tips?