Closed skleinbo closed 2 years ago
Could you please add to the docstring (or comment here) what contract change you exactly propose, i.e. the exact rules what inputs are accepted and what outputs they should produce. Thank you!
I'm sorry! Guess I jumped the gun there a bit.
Let SupportedTypes = Union{AbstractString, AbstractChar, Number}
.
Currently: cut
takes a labels::Union[AbstractString, Function::AbstractString}
kwarg and returns CategoricalArray
with levels of type String
or Union{String, Missing}
depending on whether the input array has missing values or extend=missing
is given.
Proposed: Allow cut
to take more general labels::Union{AbstractVector{L}, Function::L} where L<:SupportedTypes
and return CategoricalArray
with levels of type L
or Union{L, Missing}
depending on whether the input array has missing values or extend=missing
is given.
It appears the limitation to String levels was introduced due to stability issues in the return type. As far as I can tell, those do not exist in more recent versions of Julia.
The change would mean closer feature parity with pandas.cut
In light of that, my above comments hopefully make more sense.
There is one thing that could potentially break something downstream. I've amended the original comment to keep bullet points together.
OK - so I understand the use-case is to e.g. assign a floating middle of the interval as a level, so that later user can work with it programmatically. Right?
Exactly. I saw the (linked above) topic on discourse and wondered why that wasn't supported. Maybe you know of a good reason why not.
@nalimilan designed it, but I guess the reason was that cut values are naturally considered to be labels. However, I agree that sometimes it is useful to be able to work with them programmatically later.
Could you please add tests for typically problematic cases like: duplicate numeric labels, or having both -0.0
and 0.0
as labels? Thank you!
@nalimilan Thank you for the thorough review. I will push changes soon. I think I have tracked down an inference issue and it may just work on Julia 1.6. It had trouble inferring breaks
with input arrays that had Missing
in their type. I'll have to write more tests first though.
@bkamins Certainly!
Concerning -0.0, 0.0
. This is valid independent of cut
julia> CategoricalArray([-0.0, 0.0, 1.0]; levels=[-0.0, 0.0, 1.0])
3-element CategoricalArray{Float64,1,UInt32}:
-0.0
0.0
1.0
Do you want cut
to check for duplicates with ==
rather than isequal
regardless?
Yes, it is valid in CategoricalArrays.jl and expected. I think the same should be respected in this PR (i.e. labels should be unique w.r.t. isequal
). I just want it tested.
The point is that cat
uses ==
for values:
julia> cut([-0.0, 0.0], 2)
ERROR: ArgumentError: could not extend breaks as all values are equal: please specify at least two breaks manually
but of course this is an orthogonal issue (it just prompted me to cover this case in tests)
The point is that
cat
uses==
for values:
But only when used as cut(x, n)
, because then values pass through quantile
I guess?`
Anyway, orthogonal issue like you say.
@nalimilan Tests on Julia 1.0 failed, because of a begin
in indexing, and on x86
because of an Int64
instead of Int
in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?
@nalimilan Tests on Julia 1.0 failed, because of a
begin
in indexing, and onx86
because of anInt64
instead ofInt
in a test. Regarding the former, I find it difficult to remember what syntax was valid at which point. Wasn't there a tool to check for backward compatible syntax?
I'm not aware of any automated tool. Compat.jl is often useful but it doesn't support all new syntaxes.
Thanks!
Thank you for patient reviews :)
Referring to https://discourse.julialang.org/t/creating-data-bins-with-numeric-labels-with-cut/81281/10?u=skleinbo
Inference seems to be fine. Tests pass on
Julia 1.7
.Three things:
1.6
, but that is due toBase.return_types
not inferring correctly. The actual return type is as expected.That's not ideal. Should the test be taken out, somehow altered to pass on older Julia versions (don't know how tbh), or executed conditionally?
AbstractVector{<:SupportedTypes}
. E.g. just passing[1,"2",3]::Vector{Any}
fails, whileUnion{Int, String}[1,"2",3]
is fine. Should an effort be made to automatically do a conversion? Given the IMHO obscurity of mixed labels, should this even be mentioned in the docs? I say no on both points.cut
implicitly casts labels of anyAbstractString
toString
. While not promised, someone downstream might rely on that conversion.Thanks!