Open ablaom opened 6 years ago
It's not terribly useful to return a CategoricalArray
in that case, since with unique values this type will take more memory than a standard Array
. But indeed that also means you don't get CategoricalValue
/CategoricalString
objects which include ordering information.
What's your use case?
Hi Milan,
Thanks for your message.
I have code that relabels vectors of arbitrary type into integer vectors, based on "training" the code on one particular vector, which is presumed to take on all values likely to be encountered in new vectors. My vectors are usually columns of a DataFrame. I want to switch between the two representations of the vectors. I use unique
to determine what the values are but my code won't work as expected if unique
changes the element type.
I do have a work around. It appears that v -> collect(Set(v))
is closed under type (at least for columns of DataFrames) and otherwise does the same thing.
Anthony
I use
unique
to determine what the values are but my code won't work as expected ifunique
changes the element type.
More specifically, could you explain briefly why the code doesn't work if the types differ? I'm trying to evaluate whether this pattern can be common.
Thanks for your message.
I am developing a Julia machine learning environment and am wrapping a learning algorithm that expects categorical features to have Int type. In my environment, data is initially, by default, in DataFrame form; I must therefore transform the categorical columns into integer vectors. The initial eltype of the column is unknown; we just know it represents a categorical. Note that I must record the actual labelling used, so that I can transform new instances of data (test data) later on.
I realise that your CategoricalArrays is already doing something like this under the hood, but as a user I don't want to bother looking inside :-) . Also, I don't know ahead of time if my column is indeed a CategoricalArray or something else.
Here is a simplified version of my code for relabelling a vector (the columns of some DataFrame) with integers. Some obvious checks are missing.
———————-
struct ToIntScheme{T} int_given_T::Dict{T, Int} T_given_int::Dict{Int, T} end
function fit(v::AbstractVector{T}) where T
int_given_T = Dict{T, Int}()
T_given_int = Dict{Int, T}()
vals = collect(Set(v)) # <—— natural to use unique(v)
here but then typeof(vals) != T
i = 1
for c in vals
int_given_T[c] = i # <—— `c` must be type `T` here or I get an error
T_given_int[i] = c
i = i + 1
end
return ToIntScheme{T}(int_given_T, T_given_int)
end
transform(scheme::ToIntScheme{T}, x::T) where T = scheme.int_given_T[x]
using CategoricalArrays v = [Char(rand(UInt8)) for i in 1:10^4]; v[1:10] vcat = CategoricalVector(v); typeof(vcat) typeof(unique(vcat)) scheme = fit(vcat[1:end-1]); # fit to all but last element of vcat y = transform(scheme, vcat[end]) # transform last element according to scheme
Update: My code has moved on and the use-case above no longer exists. On reflection, I'm not sure there is a compelling reason to favour different behaviour. Feel free to close.
I have another case of the code that is agnostic of the array representation and breaks if unique!()
returns levels.
Suppose there is a function
nodes(edges::AbstractDataFrame) = DataFrame(id = sort!(unique(vcat(edges.source, edges.target))))
that works with the dataframe representation of a graph (dataframe edges
with columns source
and target
) and returns the data frame of the graph nodes.
It's expected that the type of the resulting id
column is the same as edges.source
and edges.target
, but with the current behavior of unique(::CategoricalVector)
it would be Vector
if source and target are categorical vectors.
So the user code expecting nodes.id to be categorical (e.g. levels(nodes.id)
) would fail.
But there are more annoying subtle bugs. E.g. sort!(unique(...))
would sort by value, not by the level index.
So if the levels of source and target are not sorted, the order in the nodes.id
would be different than in levels(edges.source)
etc.
When one applies the
unique
function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6: