JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Confusing `levels` fallback #54

Open aplavin opened 1 year ago

aplavin commented 1 year ago

Looking at the concrete implementation of levels in CategoricalArrays, I see that this function conveniently extracts possible levels from both arrays and individual values:

julia> a = CategoricalArray(["abc", "def", "abc"])
3-element CategoricalArray{String,1,UInt32}:
 "abc"
 "def"
 "abc"

julia> x = a[1]
CategoricalValue{String, UInt32} "abc"

julia> levels(a)
2-element Vector{String}:
 "abc"
 "def"

julia> levels(x)
2-element Vector{String}:
 "abc"
 "def"

However, the fallback implemented in DataAPI itself is only correct for collections, not individual values:

# like unique(x), makes sense?
julia> levels(["abc", "def", "abc"])
2-element Vector{String}:
 "abc"
 "def"

# makes no sense:
julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

# especially confusing given that:
julia> x == "abc"
true

Maybe, the fallback shouldn't exist at all, and something like haslevels(x)::Bool added instead?

bkamins commented 1 year ago

is only correct for collections

The problem is different. The function is correct, but it is defined for all collections, e.g.:

julia> levels("abc")
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

The problem is, that many collections, like "abc" or 1 are perceived by users as individual values not as collections.

The same problem is with unique.

What we maybe could do is changing that levels is defined for arrays by default (the problem is that it would be breaking).

aplavin commented 1 year ago

What we maybe could do is changing that levels is defined for arrays by default

There are also arrays commonly perceived as individual values - most notably, SVectors and FieldArrays from StaticArrays.

My point is that "levels of a collection" and "levels of a value" are fundamentally different functions, if talking about their generic versions, not specific levels(::CategoricalArray) and levels(::CategoricalValue). So, maybe the fallback shouldn't be defined at all? Is there any actual usage of it?

bkamins commented 1 year ago

Let me summarize the discussion, so that I am clear what problem we want to resolve (I think you raised it in Slack but I do not remember the details, and soon it will be lost on Slack). Current state is the following:

In what cases the current behavior is problematic (I mean in practice; I understand your conceptual concerns so there is no need to comment on this).

aplavin commented 1 year ago

In what cases the current behavior is problematic

First, I saw the levels(::CategoricalValue) function. It returns the set of levels the value is chosen from, exactly what I needed. Then, I wanted to support both categorical and non-categorical values, and needed a way to retrieve levels only if they are defined for the value. So, I tried levels(any other value) such as levels("abc"). I expected nothing as the result, or at worst ("abc",), but definitely not ['a', 'b', 'c'].

So, currently it looks like levels is hardly useable with generic values - one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

bkamins commented 1 year ago

one needs to know in advance that the input is a CategoricalValue, otherwise levels(val) don't make much sense.

I understand the current design the opposite way (I do not want to say that it should not be changed, but I want to express my current understanding of the design):

So, you should not call levels on any value that is not a collection, unless you know beforehand that it is a CategoricalValue.

Therefore, what you essentially ask for is how to check that a non-collection is CategoricalValue without having CategoricalArrays.jl as a dependency. This could be doable (e.g. we could add some DataAPI.jl method to check for it). However, a normal way to do it would be to add dependency on CategoricalArrays.jl and just check if some value is a CategoricalValue.

An alternative would be to add something like parentlevels function, so that parentlevels would be defined for CategoricalValue as the same as levels, and would be nothing by default as you want. @nalimilan - what do you think?

aplavin commented 1 year ago

Yes, I do understand the levels-of-collection PoV, maybe it's more common indeed. I'm talking from the perspective of someone who only needed "parent levels" of a single value, and found that the levels method does exactly what's needed for CategoricalValues (but only for them).

bkamins commented 1 year ago

Agreed. That is why maybe adding parentlevels to be explicit makes sense. Let us wait for @nalimilan to respond what he things, as he designed these features.

nalimilan commented 1 year ago

Yes, I think the best solution would be to add an API that allows checking whether a value or array is categorical, whose only implementations (currently) would be CategoricalArray and CategoricalValue.

It's true that the fact that the levels fallback works on any iterable value and can return weird results for strings and numbers is a bit confusing, but introducing parentlevels would be overkill IMO.

bkamins commented 1 year ago

And for the mean time having CategoricalArrays.jl as a dependency is a temporary solution.