JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 33 forks source link

Problem with unstack from DataFrames.jl on CategoricalVector #380

Closed bkamins closed 2 years ago

bkamins commented 2 years ago

@nalimilan - do you have an idea how we could fix this:

julia> df = DataFrame(row=[1,1,2,2], col=["a","b","a","b"],val=categorical('a':'d'))
4×3 DataFrame
 Row │ row    col     val  
     │ Int64  String  Cat… 
─────┼─────────────────────
   1 │     1  a       a
   2 │     1  b       b
   3 │     2  a       c
   4 │     2  b       d

julia> unstack(df,:row,:col,:val, fill=1)
2×3 DataFrame
 Row │ row    a    b   
     │ Int64  Any  Any 
─────┼─────────────────
   1 │     1  a    b
   2 │     2  c    d

julia> unstack(df,:row,:col,:val, fill=1).a
2-element Vector{Any}:
 CategoricalValue{Char, UInt32} 'a'
 CategoricalValue{Char, UInt32} 'c'

or maybe we decide that this is the intended behavior?

This decision also affects https://github.com/JuliaData/DataFrames.jl/pull/3012 where we have the same issue.

nalimilan commented 2 years ago

One solution would be to ignore the type of fill when choosing the element type of the column to allocate. But that would be problematic in particular for missing, but also e.g. for fill=1.5 if original columns are integers (less likely). We could special-case missing but that's not ideal.

Is there any reason to think that people may do this kind of thing? Even without CategoricalArray, you'd get a column with element type Any, which is usually not what one wants.

bkamins commented 2 years ago

Maybe there is no good solution to this and we should add a comment to a docstring what do to in case of categorical columns as a special case then? (as you have to pass CategoricalValue to keep column categorical) Also note that:

julia> df = DataFrame(row=[1,1,2,2], col=["a","b","a","b"],val=categorical('a':'d'))
4×3 DataFrame
 Row │ row    col     val
     │ Int64  String  Cat…
─────┼─────────────────────
   1 │     1  a       a
   2 │     1  b       b
   3 │     2  a       c
   4 │     2  b       d

julia> unstack(df,:row,:col,:val, fill='e')
2×3 DataFrame
 Row │ row    a     b
     │ Int64  Char  Char
─────┼───────────────────
   1 │     1  a     b
   2 │     2  c     d

julia> unstack(df,:row,:col,:val, fill='e').a
2-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

so if you pass fill of the correct base type things get unwrapped.

nalimilan commented 2 years ago

This can be closed, right?

bkamins commented 2 years ago

OK