Open aplavin opened 2 years ago
is only correct for collections
The problem is different. The function is correct, but it is defined for all collections, e.g.:
julia> levels("abc")
3-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
The problem is, that many collections, like "abc"
or 1
are perceived by users as individual values not as collections.
The same problem is with unique
.
What we maybe could do is changing that levels
is defined for arrays by default (the problem is that it would be breaking).
What we maybe could do is changing that levels is defined for arrays by default
There are also arrays commonly perceived as individual values - most notably, SVector
s and FieldArray
s from StaticArrays
.
My point is that "levels of a collection" and "levels of a value" are fundamentally different functions, if talking about their generic versions, not specific levels(::CategoricalArray)
and levels(::CategoricalValue)
. So, maybe the fallback shouldn't be defined at all? Is there any actual usage of it?
Let me summarize the discussion, so that I am clear what problem we want to resolve (I think you raised it in Slack but I do not remember the details, and soon it will be lost on Slack). Current state is the following:
levels
recovers levels of a collection;CategoricalValue
as a special case levels
recovers levels
from a source array of this categorical value.In what cases the current behavior is problematic (I mean in practice; I understand your conceptual concerns so there is no need to comment on this).
In what cases the current behavior is problematic
First, I saw the levels(::CategoricalValue)
function. It returns the set of levels the value is chosen from, exactly what I needed.
Then, I wanted to support both categorical and non-categorical values, and needed a way to retrieve levels only if they are defined for the value. So, I tried levels(any other value)
such as levels("abc")
. I expected nothing
as the result, or at worst ("abc",)
, but definitely not ['a', 'b', 'c']
.
So, currently it looks like levels
is hardly useable with generic values - one needs to know in advance that the input is a CategoricalValue, otherwise levels(val)
don't make much sense.
one needs to know in advance that the input is a
CategoricalValue
, otherwiselevels(val)
don't make much sense.
I understand the current design the opposite way (I do not want to say that it should not be changed, but I want to express my current understanding of the design):
levels
normally should be only called on a collection; you should not call it on a value that is not a collectionCategoricalValue
you can call it to get levels
of a parent of this valueSo, you should not call levels
on any value that is not a collection, unless you know beforehand that it is a CategoricalValue
.
Therefore, what you essentially ask for is how to check that a non-collection is CategoricalValue
without having CategoricalArrays.jl as a dependency. This could be doable (e.g. we could add some DataAPI.jl method to check for it). However, a normal way to do it would be to add dependency on CategoricalArrays.jl and just check if some value is a CategoricalValue
.
An alternative would be to add something like parentlevels
function, so that parentlevels
would be defined for CategoricalValue
as the same as levels
, and would be nothing
by default as you want. @nalimilan - what do you think?
Yes, I do understand the levels-of-collection PoV, maybe it's more common indeed. I'm talking from the perspective of someone who only needed "parent levels" of a single value, and found that the levels
method does exactly what's needed for CategoricalValues (but only for them).
Agreed. That is why maybe adding parentlevels
to be explicit makes sense. Let us wait for @nalimilan to respond what he things, as he designed these features.
Yes, I think the best solution would be to add an API that allows checking whether a value or array is categorical, whose only implementations (currently) would be CategoricalArray
and CategoricalValue
.
It's true that the fact that the levels
fallback works on any iterable value and can return weird results for strings and numbers is a bit confusing, but introducing parentlevels
would be overkill IMO.
And for the mean time having CategoricalArrays.jl as a dependency is a temporary solution.
Looking at the concrete implementation of
levels
inCategoricalArrays
, I see that this function conveniently extracts possible levels from both arrays and individual values:However, the fallback implemented in
DataAPI
itself is only correct for collections, not individual values:Maybe, the fallback shouldn't exist at all, and something like
haslevels(x)::Bool
added instead?