JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 34 forks source link

CategoricalArray type not closed under `unique` method #129

Open ablaom opened 6 years ago

ablaom commented 6 years ago

When one applies the unique function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6:

julia> CategoricalArray(["a","b","c", "a"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "a"

julia> unique(ans)
3-element Array{String,1}:
 "a"
 "b"
 "c"

julia> VERSION
v"0.6.2"
nalimilan commented 6 years ago

It's not terribly useful to return a CategoricalArray in that case, since with unique values this type will take more memory than a standard Array. But indeed that also means you don't get CategoricalValue/CategoricalString objects which include ordering information.

What's your use case?

ablaom commented 6 years ago

Hi Milan,

Thanks for your message.

I have code that relabels vectors of arbitrary type into integer vectors, based on "training" the code on one particular vector, which is presumed to take on all values likely to be encountered in new vectors. My vectors are usually columns of a DataFrame. I want to switch between the two representations of the vectors. I use unique to determine what the values are but my code won't work as expected if unique changes the element type.

I do have a work around. It appears that v -> collect(Set(v)) is closed under type (at least for columns of DataFrames) and otherwise does the same thing.

Anthony

nalimilan commented 6 years ago

I use unique to determine what the values are but my code won't work as expected if unique changes the element type.

More specifically, could you explain briefly why the code doesn't work if the types differ? I'm trying to evaluate whether this pattern can be common.

ablaom commented 6 years ago

Thanks for your message.

I am developing a Julia machine learning environment and am wrapping a learning algorithm that expects categorical features to have Int type. In my environment, data is initially, by default, in DataFrame form; I must therefore transform the categorical columns into integer vectors. The initial eltype of the column is unknown; we just know it represents a categorical. Note that I must record the actual labelling used, so that I can transform new instances of data (test data) later on.

I realise that your CategoricalArrays is already doing something like this under the hood, but as a user I don't want to bother looking inside :-) . Also, I don't know ahead of time if my column is indeed a CategoricalArray or something else.

Here is a simplified version of my code for relabelling a vector (the columns of some DataFrame) with integers. Some obvious checks are missing.

———————-

the data structure for storing the relabelling dictionaries:

struct ToIntScheme{T} int_given_T::Dict{T, Int} T_given_int::Dict{Int, T} end

function fit(v::AbstractVector{T}) where T int_given_T = Dict{T, Int}() T_given_int = Dict{Int, T}() vals = collect(Set(v)) # <—— natural to use unique(v) here but then typeof(vals) != T

i = 1
for c in vals
    int_given_T[c] = i # <—— `c` must be type `T` here or I get an error
    T_given_int[i] = c 
    i = i + 1
end

return ToIntScheme{T}(int_given_T, T_given_int)

end

transform a scalar according to given scheme:

transform(scheme::ToIntScheme{T}, x::T) where T = scheme.int_given_T[x]

demonstration:

using CategoricalArrays v = [Char(rand(UInt8)) for i in 1:10^4]; v[1:10] vcat = CategoricalVector(v); typeof(vcat) typeof(unique(vcat)) scheme = fit(vcat[1:end-1]); # fit to all but last element of vcat y = transform(scheme, vcat[end]) # transform last element according to scheme

ablaom commented 5 years ago

Update: My code has moved on and the use-case above no longer exists. On reflection, I'm not sure there is a compelling reason to favour different behaviour. Feel free to close.

alyst commented 3 years ago

I have another case of the code that is agnostic of the array representation and breaks if unique!() returns levels.

Suppose there is a function

nodes(edges::AbstractDataFrame) = DataFrame(id = sort!(unique(vcat(edges.source, edges.target))))

that works with the dataframe representation of a graph (dataframe edges with columns source and target) and returns the data frame of the graph nodes. It's expected that the type of the resulting id column is the same as edges.source and edges.target, but with the current behavior of unique(::CategoricalVector) it would be Vector if source and target are categorical vectors. So the user code expecting nodes.id to be categorical (e.g. levels(nodes.id)) would fail.

But there are more annoying subtle bugs. E.g. sort!(unique(...)) would sort by value, not by the level index. So if the levels of source and target are not sorted, the order in the nodes.id would be different than in levels(edges.source) etc.