Closed ExpandingMan closed 2 years ago
Please have a look at https://github.com/bensadeghi/DecisionTree.jl/issues/92 (specifically https://github.com/bensadeghi/DecisionTree.jl/issues/92#issuecomment-555737788)
Long story short DT.jl does not actually support un-ordered categorical inputs.
Oh man, :facepalm: .
I just thought about this more carefully and realized that yeah, one should not use categorical inputs here as the splits assume a certain ordering. The only real solution to this problem only makes sense for binary classification, so for regression (my case) either use OHE or lose a lot of splits.
Btw, might still be nice to have more documentation on this, though perhaps we can chalk opening this issue up to my stupidity.
You're definitely not the first nor the last nor the last one to stumble upon this; it might make sense to add a comment to the DT issue though (and propose a PR there, ideally); MLJ can help flag such issues in dedicated package model but I don't think it's reasonable to have it keep track of all the issues 😅
It would be nice to support non-binary splits, there's a huge class of cases in which you only have 3 or 4 categories and it isn't necessarily prohibitive to have n-way splits. I haven't looked into how complicated that would be, my guess is that this package was written making the assumption of binary splits pretty much everywhere so it may not be so simple.
I don't disagree but this conversation would be best had in the DT package and not here which is just an interface (which, actually, should be incorporated into DT eventually if someone ever picks that up...)
I believe the BetaML version of this model (which has an MLJ interface) supports Multiclass
and considers all possible splits, not just those consistent with an order. @sylvaticus
Yes it does but, of course.. you pay a huge (computational) price, as all columns are treated as unordered. Maybe we could change the algorithm so that the user can specify the ordinality condition on a column base...
Yes, I was not suggesting that non-binary splits be the default, just that there would be an option to do this for Multiclass
inputs. Even then, in most cases it would only be practical for features with just a few classes, but it seems like a common enough use case to be worth it.
Maybe we could change the algorithm so that the user can specify the ordinality condition on a column base...
You will recall that MLJ users indicate this by the type of column they pass. A column is to be interpreted as an OrderedFactor
if and only if it is a CategoricalVector v
for which isordered(v) === true
; it is intended to be interpreted as an unordered factor (Multiclass
) if and only if it is a CategoricalVector and isordered(v) === false
. You can determine which is the case by inspecting MLJModelInterface.schema(X)
(assuming MLJBase is loaded).
When training with
Multiclass
inputs, I get warnings likeIs this intended? I don't see why
Multiclass
would be included here, in fact, if I domodels(matching(Xtrain, ytrain))
on my inputs, the models I'm attempting to use indeed show up: