JuliaAI / MLJDecisionTreeInterface.jl

MIT License
9 stars 4 forks source link

why do input scitypes not include `Multiclass`? #10

Closed ExpandingMan closed 2 years ago

ExpandingMan commented 2 years ago

When training with Multiclass inputs, I get warnings like

┌ Warning: The scitype of `X`, in `machine(model, X, ...)` is incompatible with `model=DeterministicTunedModel{Grid,…}`:
│ scitype(X) = Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Multiclass{12}}, AbstractVector{Multiclass{1}}, AbstractVector{Multiclass{6}}}}
│ input_scitype(model) = Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}.
└ @ MLJBase ~/.julia/packages/MLJBase/QXObv/src/machines.jl:133

Is this intended? I don't see why Multiclass would be included here, in fact, if I do models(matching(Xtrain, ytrain)) on my inputs, the models I'm attempting to use indeed show up:

◖◗ models(matching(Xtrain, ytrain))
4-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = RandomForestRegressor, package_name = BetaML, ... )
tlienart commented 2 years ago

Please have a look at https://github.com/bensadeghi/DecisionTree.jl/issues/92 (specifically https://github.com/bensadeghi/DecisionTree.jl/issues/92#issuecomment-555737788)

Long story short DT.jl does not actually support un-ordered categorical inputs.

ExpandingMan commented 2 years ago

Oh man, :facepalm: .

I just thought about this more carefully and realized that yeah, one should not use categorical inputs here as the splits assume a certain ordering. The only real solution to this problem only makes sense for binary classification, so for regression (my case) either use OHE or lose a lot of splits.

Btw, might still be nice to have more documentation on this, though perhaps we can chalk opening this issue up to my stupidity.

tlienart commented 2 years ago

You're definitely not the first nor the last nor the last one to stumble upon this; it might make sense to add a comment to the DT issue though (and propose a PR there, ideally); MLJ can help flag such issues in dedicated package model but I don't think it's reasonable to have it keep track of all the issues 😅

ExpandingMan commented 2 years ago

It would be nice to support non-binary splits, there's a huge class of cases in which you only have 3 or 4 categories and it isn't necessarily prohibitive to have n-way splits. I haven't looked into how complicated that would be, my guess is that this package was written making the assumption of binary splits pretty much everywhere so it may not be so simple.

tlienart commented 2 years ago

I don't disagree but this conversation would be best had in the DT package and not here which is just an interface (which, actually, should be incorporated into DT eventually if someone ever picks that up...)

ablaom commented 2 years ago

I believe the BetaML version of this model (which has an MLJ interface) supports Multiclass and considers all possible splits, not just those consistent with an order. @sylvaticus

sylvaticus commented 2 years ago

Yes it does but, of course.. you pay a huge (computational) price, as all columns are treated as unordered. Maybe we could change the algorithm so that the user can specify the ordinality condition on a column base...

ExpandingMan commented 2 years ago

Yes, I was not suggesting that non-binary splits be the default, just that there would be an option to do this for Multiclass inputs. Even then, in most cases it would only be practical for features with just a few classes, but it seems like a common enough use case to be worth it.

ablaom commented 2 years ago

Maybe we could change the algorithm so that the user can specify the ordinality condition on a column base...

You will recall that MLJ users indicate this by the type of column they pass. A column is to be interpreted as an OrderedFactor if and only if it is a CategoricalVector v for which isordered(v) === true; it is intended to be interpreted as an unordered factor (Multiclass) if and only if it is a CategoricalVector and isordered(v) === false. You can determine which is the case by inspecting MLJModelInterface.schema(X) (assuming MLJBase is loaded).