JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 33 forks source link

Allow strings and numbers as labels in cut #393

Closed skleinbo closed 2 years ago

skleinbo commented 2 years ago

Referring to https://discourse.julialang.org/t/creating-data-bins-with-numeric-labels-with-cut/81281/10?u=skleinbo

Inference seems to be fine. Tests pass on Julia 1.7.

Three things:

  1. There is a test error on 1.6, but that is due to Base.return_types not inferring correctly. The actual return type is as expected.
    That's not ideal. Should the test be taken out, somehow altered to pass on older Julia versions (don't know how tbh), or executed conditionally?
  2. Technically, mixed labels are possible when passed as a AbstractVector{<:SupportedTypes}. E.g. just passing [1,"2",3]::Vector{Any} fails, while Union{Int, String}[1,"2",3] is fine. Should an effort be made to automatically do a conversion? Given the IMHO obscurity of mixed labels, should this even be mentioned in the docs? I say no on both points.
  3. A label formatter must always return the same concrete type. That's a disparity to 2. and one could jump through the same hoops to allow for mixed labels, but is it worth it?
  4. The current cut implicitly casts labels of any AbstractString to String. While not promised, someone downstream might rely on that conversion.

Thanks!

bkamins commented 2 years ago

Could you please add to the docstring (or comment here) what contract change you exactly propose, i.e. the exact rules what inputs are accepted and what outputs they should produce. Thank you!

skleinbo commented 2 years ago

I'm sorry! Guess I jumped the gun there a bit.

Let SupportedTypes = Union{AbstractString, AbstractChar, Number}.

Currently: cut takes a labels::Union[AbstractString, Function::AbstractString} kwarg and returns CategoricalArray with levels of type String or Union{String, Missing} depending on whether the input array has missing values or extend=missing is given.

Proposed: Allow cut to take more general labels::Union{AbstractVector{L}, Function::L} where L<:SupportedTypes and return CategoricalArray with levels of type L or Union{L, Missing} depending on whether the input array has missing values or extend=missing is given.

It appears the limitation to String levels was introduced due to stability issues in the return type. As far as I can tell, those do not exist in more recent versions of Julia.

The change would mean closer feature parity with pandas.cut

In light of that, my above comments hopefully make more sense.

There is one thing that could potentially break something downstream. I've amended the original comment to keep bullet points together.

bkamins commented 2 years ago

OK - so I understand the use-case is to e.g. assign a floating middle of the interval as a level, so that later user can work with it programmatically. Right?

skleinbo commented 2 years ago

Exactly. I saw the (linked above) topic on discourse and wondered why that wasn't supported. Maybe you know of a good reason why not.

bkamins commented 2 years ago

@nalimilan designed it, but I guess the reason was that cut values are naturally considered to be labels. However, I agree that sometimes it is useful to be able to work with them programmatically later.

bkamins commented 2 years ago

Could you please add tests for typically problematic cases like: duplicate numeric labels, or having both -0.0 and 0.0 as labels? Thank you!

skleinbo commented 2 years ago

@nalimilan Thank you for the thorough review. I will push changes soon. I think I have tracked down an inference issue and it may just work on Julia 1.6. It had trouble inferring breaks with input arrays that had Missing in their type. I'll have to write more tests first though.

@bkamins Certainly!

skleinbo commented 2 years ago

Concerning -0.0, 0.0. This is valid independent of cut

julia> CategoricalArray([-0.0, 0.0, 1.0]; levels=[-0.0, 0.0, 1.0])
3-element CategoricalArray{Float64,1,UInt32}:
 -0.0
 0.0
 1.0

Do you want cut to check for duplicates with == rather than isequal regardless?

bkamins commented 2 years ago

Yes, it is valid in CategoricalArrays.jl and expected. I think the same should be respected in this PR (i.e. labels should be unique w.r.t. isequal). I just want it tested.

The point is that cat uses == for values:

julia> cut([-0.0, 0.0], 2)
ERROR: ArgumentError: could not extend breaks as all values are equal: please specify at least two breaks manually

but of course this is an orthogonal issue (it just prompted me to cover this case in tests)

skleinbo commented 2 years ago

The point is that cat uses == for values:

But only when used as cut(x, n), because then values pass through quantile I guess?` Anyway, orthogonal issue like you say.

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

nalimilan commented 2 years ago

@nalimilan Tests on Julia 1.0 failed, because of a begin in indexing, and on x86 because of an Int64 instead of Int in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?

I'm not aware of any automated tool. Compat.jl is often useful but it doesn't support all new syntaxes.

nalimilan commented 2 years ago

Thanks!

nalimilan commented 2 years ago

https://github.com/JuliaRegistries/General/pull/60890

skleinbo commented 2 years ago

Thank you for patient reviews :)