Closed tlienart closed 4 years ago
The issue here is y
is essentially corrupted. The elements are indeed Multiclass
, because that's what the pool of each element says, but the wrapper type (CategoricalArray) - which has its own pool, is saying the type is OrderedFactor
. The source of this bug is most likely the new "magic" introduced in CategoricalArrays to automatically construct a CategoricalArray wrapper when a broadcasting operation generates a vector of categorical elements.
I will investigate further.
julia> y[1].pool.ordered
false
julia> y.pool.ordered
true
This was never an issue before the performance improvements because we never looked at the wrapper type before then.
Okay, I would regard this as a bug in CategoricalArrays:
julia> v=categorical([:x, :y, :x])
3-element CategoricalArray{Symbol,1,UInt32}:
:x
:y
:x
julia> vordered = categorical(v, ordered=true)
3-element CategoricalArray{Symbol,1,UInt32}:
:x
:y
:x
julia> vordered[1].pool.ordered
false
Ok no so the merge does NOT fix the issue, example
using RDatasets
smarket = dataset("ISLR", "Smarket")
y = smarket.Direction
y = coerce(y, OrderedFactor)
then
julia> scitype(y)
AbstractArray{OrderedFactor{2},1}
julia> scitype_union(y)
Multiclass{2}
I should have tested this more extensively.
It looks like my request for elscitype
still holds then...
julia> y = smarket.Direction;
julia> yc = coerce(y, OrderedFactor);
julia> [scitype(el) for el in yc]
1250-element Array{DataType,1}:
Multiclass{2}
Multiclass{2}
Multiclass{2}
Right... so it's an issue with scitype
of an element...
julia> scitype(yc[1])
Multiclass{2}
julia> scitype(yc)
AbstractArray{OrderedFactor{2},1}
yeah....
Ok so it seems that it's the ordered!
function that craps itself as noted in https://github.com/JuliaData/CategoricalArrays.jl/issues/226
I've added an extra hack that forces the pool to be marked as ordered... let's hope for something better from CA soon.
With:
julia> y1 = smarket.Direction[1:10];
julia> y2 = coerce(y1, OrderedFactor);
julia> scitype_union(y2)
OrderedFactor{2}
As a side note, I still think a elscitype
would be better than scitype_union
which goes over all elements, while this may be warranted in some cases, in general we don't care, we just want the type that appears in scitype
so the:
get_elscitype(st::Type{AbstractArray{T,N}}) where {T,N} = T
elscitype(X::AbstractArray) = scitype(X) |> _get_elscitype
that I suggested earlier, I'm opening another PR with that as I'd prefer to use that in confusion_matrix
for instance, it's bound to be faster...
The "new" issue is resolved by CategoricalArrays 0.7.3:
using RDatasets
y = smarket.Direction
y = coerce(y, OrderedFactor)
julia> scitype(y)
AbstractArray{OrderedFactor{2},1}
julia> scitype_union(y)
OrderedFactor{2}
Closing as resolved.
Will comment on elscitype
suggestion at #58.
Context: https://github.com/alan-turing-institute/MLJBase.jl/issues/111
Desired
Existing
(is the second one a bug?)
Possible implementation
(can be extended easily to tables etc)
Result