FluxML / OneHotArrays.jl

Memory efficient one-hot array encodings
https://fluxml.ai/OneHotArrays.jl/dev/
MIT License
18 stars 7 forks source link

Support Categorical Values directly #45

Open schlichtanders opened 6 months ago

schlichtanders commented 6 months ago

Motivation and description

In Data Science CategoricalArrays.CategoricalValue or CategoricalArrays.CategoricalVector and the like appear often. (RDatasets loads DataFrames with columns of that type by default).

It would be great if onehotbatch could simply be applied on this.

I just came to this package, still figuring out how to transform such a Categorical Value/Vector into onehot Vector/Matrix... It is very possible that I missed something obvious

Possible Implementation

No response

mcabbott commented 6 months ago

Attempting to construct the minimal object:

julia> using CategoricalArrays, OneHotArrays

julia> cv = CategoricalArrays.CategoricalValue('b', CategoricalArray('a':'z'))
CategoricalValue{Char, UInt32} 'b'

julia> dump(cv)
CategoricalValue{Char, UInt32}
  pool: CategoricalPool{Char, UInt32, CategoricalValue{Char, UInt32}}
    levels: Array{Char}((26,))
      1: Char 'a'
      2: Char 'b'
      3: Char 'c'
      4: Char 'd'
      5: Char 'e'
      ...
      22: Char 'v'
      23: Char 'w'
      24: Char 'x'
      25: Char 'y'
      26: Char 'z'
    invindex: Dict{Char, UInt32}
      slots: Memory{UInt8}
        length: Int64 64
        ptr: Ptr{Nothing} @0x0000000160607020
    ...

julia> cv.pool.levels
26-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
...

julia> Int(cv.ref), length(cv.pool.levels)
(2, 26)

julia> OneHotArrays.onehot(cv::CategoricalValue) = OneHotVector(cv.ref, length(cv.pool.levels))

julia> onehot(cv)
26-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 1
 ⋅
 ⋅
 ⋅
 ⋅
...

julia> dump(onehot(cv))
OneHotVector{UInt32}
  indices: UInt32 0x00000002
  nlabels: Int64 26

Are these two integers all that's required, or are there more complicated examples?

schlichtanders commented 6 months ago

I think this is all, but I am not an expert on CategoricalArrays