Closed juliohm closed 5 years ago
Ok the error I'm seeing when reproducing this is:
julia> MLJBase.predict(m, θ, X)
ERROR: classes must have CategoricalValue or CategoricalString type.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] UnivariateFinite(::Array{Int64,1}, ::Array{Float64,1}) at /Users/tlienart/.julia/dev/MLJBase/src/distributions.jl:111
[3] (::MLJModels.DecisionTree_.var"##4#5"{DecisionTreeClassifier,Array{Int64,1},Array{Float64,2}})(::Int64) at ./none:0
[4] iterate at ./generator.jl:47 [inlined]
[5] collect at ./array.jl:620 [inlined]
[6] predict(::DecisionTreeClassifier, ::Tuple{DecisionTree.Node{Float64,Int64},Array{Int64,1}}, ::Array{Float64,2}) at /Users/tlienart/.julia/dev/MLJModels/src/DecisionTree.jl:173
[7] top-level scope at REPL[7]:1
adding a categorical
to y
fixes the issue
X = rand(100, 2)
y = categorical(rand(Int, 100))
...
will work
Thank you. Is the fit/predict API described formally somewhere? I would like to be able to handle all possible cases as I expand the framework. Right now some decisions like MLJBase.table
and MLJBase.categorical
are not very clear. I don't understand why we need to explicitly call categorical
, isn't the Int type a Finite
scitype? Can't we proceed without converting to categorical manually?
Right now, having to deal with the classification case manually doesn't seem ideal. As you can see, the general fit/predict I was using didn't need to know if I was in a classification versus regression task. Now I need to insert a new "if" statement in the code to handle this special case:
Can we leverage multiple-dispatch to provide a higher-level abstraction? What am I missing?
@juliohm
No, in MLJ you cannot represent categorical variables with integers. See the bottom of Getting Started and the table at the end of the ScientificTypes.jl README for details.
You can coerce the data in a vector or column so that it has the type expected by MLJ using the coerce
method, after inspecting using schema
. See the ScientificTypes.jl README for details. These methods are re-exported to MLJ and MLJBase. There is an open issue to improve this workflow further here
The decision to not allow integer representation of categorical is deliberate. A typical problem encountered with integers for categorical is this: If there are lots of levels (eg, supermarket items, medical diagnoses, words in a language) one trains on a subsample only to discover that the evaluation set contains levels not seen in the train, and your evaluation crashes. In some frameworks (Scitlearn, for example) this issue is completely ignored, which causes no end of anguish to newcomers. To avoid such problems, one has either to pass around around metadata on the number of levels, or use a suitable wrapper (like CategoricalVectors, from CategoricalArrays) or use a dedicated categorical value type (such as CategoricalValue from the same package) where every element points to a pool of all the levels (and their labels). We have chosen the last option, but as calling getindex
on a CategoricalVector element returns a CategorialValue, one can use also present the wrapped objects to MLJ. BTW, the R frameworks all use dedicated categorical element types.
Another reason for using the CategoricalVectors is to do with interpretability/usability. It's very convenient to keep the original handle on the levels, and these handles propagate when one one-hot encodes and so forth. In the physical sciences, this usually is not such an issue, but elsewhere where you have lots of levels, it is more important.
Note that most libraries for ingesting tabular data (eg, CSV) allow you to automatically convert String type data to CategoricalVectors, as a matter of course.
Thank you. Is the fit/predict API described formally somewhere?
The definitive document is https://alan-turing-institute.github.io/MLJ.jl/stable/adding_models_for_general_use/.
Regarding this code:
function g()
X = rand(100,2)
y = rand(Int, 100)
m = DecisionTreeClassifier()
θ, _, __ = MLJBase.fit(m, 0, X, y)
MLJBase.predict(m, θ, X)
end
@tlienart has correctly identified the cause of the crash but you should keep in mind that you use a matrix for X
at your own risk. A table is required:
julia> info("DecisionTreeClassifier")
OrderedCollections.LittleDict{Symbol,Any,Array{Symbol,1},Array{Any,1}} with 13 entries:
:name => "DecisionTreeClassifier"
:package_name => "DecisionTree"
:package_url => "https://github.com/bensadeghi/DecisionTree.jl"
:package_license => "unkown"
:load_path => "MLJModels.DecisionTree_.DecisionTreeClassifier"
:is_wrapper => false
:is_pure_julia => true
:package_uuid => "7806a523-6efd-50cb-b5f6-3fa6f1930dbb"
:supports_weights => false
:is_supervised => true
:is_probabilistic => true
:input_scitype => Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:C…
:target_scitype => AbstractArray{#s24,1} where #s24<:Finite
Closing as no MLJ bug identified. Feel free to re-open if one is identified, or open a relevant design discussion instead.
Just sharing that I am really appreciating this brainstorming process. Today I fixed a long-standing issue in GeoStats.jl. Now every spatial data object satisfies the Tables.jl API. I've updated my learn/perform to use this API. This means that I can pass my spatial data directly and it is interpreted as if it was a table in the MLJ.jl methods. 💯
Describe the bug
stopped working.
To Reproduce
Expected behavior The code was working fine in MLJ < 0.4
Additional context Julia v1.2