JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.79k stars 158 forks source link

Need help in creating a MLJModelInterface.Model interface of a complex model #744

Closed sylvaticus closed 3 years ago

sylvaticus commented 3 years ago

From this discourse thread:

Hi there, I am trying to build a MLJ interface for some ML algorithms in the BetaML package.

I am starting from the Decision Trees (I know decision trees are already available in MLJ, but I thought it was the easiest to start with), but I have a few questions.

The function creating (and fitting) the tree is:

buildTree(x, y::Array{Ty,1}; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=size(x,2), forceClassification=false, splittingCriterion = (Ty <: Number && !forceClassification) ? variance : gini, mCols=nothing) where {Ty}
  1. As you can see some parameters depend by default by the data, like maxFeatures depends on the dimensionality of the explanation variables. I understood that model parameters should be part of the model struct, but how do I set defaults without seeing the data ? - [SOLVED: I did set some default of the default]
  2. Even more hard, the algorithm that I am trying to wrap automatically performs a regression or a classification task (and, in the later case, it returns a probability distribution) depending on the type of the label, with the option to override the task with forceClassification. As in ML there are different type of models, probabilistic and deterministic, which one do I choose ? Or should I wrap it as two separate MLJ models ? [SOLVED: I created different MLJ models]
  3. Most of my models support Missing data in the input. I read that Missing is a scientific type per se. Should I declare an Union of supported types then, including the Missing ? [UNSOLVED, but later step]
  4. I have a case where my model doesn't fit the fit/predict workflow, that is a model that (using GMM/EM) predicts the missing values in a matrix, based on the degree of similarities of the other elements of the columns to the other rows. How to I wrap it with MLJ ? [UNSOLVED, but later step]
  5. Where can I find real-case examples ? For example, DecisionTrees.jl seems to be available through MLJ, but there is no code in the GitHub repo concerning MLJ.. [PARTIALLY SOLVED, it is in a specific interface repo, but would be nice so see an example embedded in the same model repository]
  6. My models live on submodules of the package. Will this be a problem for automatic discovery of the package/algorithm conditional to the user data ? [UNSOLVED, but later step]
  7. While I did somehow managed to write the MLJ interface for a deterministic model, I am trying to write the interface for a probabilistic model whose predict(model,X) method returns a vector of dictionary of label => prob. I normally use arrays of T for the Y, but I saw that it works also with Y being a CategoricalArray. However I am stuck here now, and don't know hot to return the prediction in the format wanted by MLJ. [SOLVED, but it was a pain]
  8. Should I name my Decision Trees DecisionTree, so that the user then select the desired one with the pkg keyword (there are already two available in MLJ) or would be preferable a more specific name, like BetaMLDecisionTree ?

Thank you!

(PS: the "Simple User Defined Models" still refers to MLJBase rather than MLJModelInterface. )

ablaom commented 3 years ago

Thank you for taking the time to wrap your head around the MLJ API. I appreciate this requires a substantial effort. Very happy to provide continuing guidance and am excited about the possibility of adding more models to the MLJ library.

Also, happy to take a call at some point, if you think that would be more efficient.

I think implementing a single end-to-end model first is a good idea, and DecisionTree is a good start (as there are two already).

Side question: The existing DecisionTree.jl models handle categorical features, but only if they represent ordered factors (scitype OrderedFactor in our parlance). It would be awfully nice if your tree model supported general categorical features (scitype Finite) (treated using Breiman's original algorithm). Is this the case?

  1. As you can see some parameters depend by default by the data, like maxFeatures depends on the dimensionality of the explanation variables. I understood that model parameters should be part of the model struct, but how do I set defaults without seeing the data ? - [SOLVED: I did set some default of the default]

Yes, hyper-parameters are viewed as data-independent concepts. If a default value depends on the data, you can do maxFeatures=nothing or maxFeatures=0 (better) and your MLJModelInterface.fit method will then respond appropriately. (Union{Nothing, _ } types are fine for your struct but avoid them if it's easy to do so.)

Per-sample weights or class-weights w are passed along with training data to MLJModelInterface.fit, as in MLJModelInterface(model, verbosity, X, y, w).

  1. Even more hard, the algorithm that I am trying to wrap automatically performs a regression or a classification task (and, in the later case, it returns a probability distribution) depending on the type of the label, with the option to override the task with forceClassification. As in ML there are different type of models, probabilistic and deterministic, which one do I choose ? Or should I wrap it as two separate MLJ models ? [SOLVED: I created different MLJ models]

Yes! More models is the MLJ way (so XGBoost is wrapped as three or four separate MLJ models).

  1. Most of my models support Missing data in the input. I read that Missing is a scientific type per se. Should I declare an Union of supported types then, including the Missing ? [UNSOLVED, but later step]

Yes. For example, if for an ordinary regression target with missing values, the target_scitype is AbstractArray{<:Union{Missing, Continuous}}. But yes, you can worry about this at the end.

  1. I have a case where my model doesn't fit the fit/predict workflow, that is a model that (using GMM/EM) predicts the missing values in a matrix, based on the degree of similarities of the other elements of the columns to the other rows. How to I wrap it with MLJ ? [UNSOLVED, but later step]

This sounds like an unsupervised model (aka "transformer") yes? These are handled much the same way: docs for user docs for implementing. examples

  1. Where can I find real-case examples ? For example, DecisionTrees.jl seems to be available through MLJ, but there is no code in the GitHub repo concerning MLJ.. [PARTIALLY SOLVED, it is in a specific interface repo, but would be nice so see an example embedded in the same model repository]

Examples of "native" implementations of the MLJ model API: NearestNeighborModels, EvoTrees, MLJLinearModels, MLJFlux, LightGBM (the IQVIA one), PartialLeastSquaresRegressor, ParallelKMeans (unsupervised)

  1. My models live on submodules of the package. Will this be a problem for automatic discovery of the package/algorithm conditional to the user data ? [UNSOLVED, but later step]

No. They can live anywhere. The load_path trait you define will allow MLJ to find it. Get me to double check this when you define it as it is critical a bit tedious to correct.

  1. While I did somehow managed to write the MLJ interface for a deterministic model, I am trying to write the interface for a probabilistic model whose predict(model,X) method returns a vector of dictionary of label => prob. I normally use arrays of T for the Y, but I saw that it works also with Y being a CategoricalArray. However I am stuck here now, and don't know hot to return the prediction in the format wanted by MLJ. [SOLVED, but it was a pain]

The key is make sure that the predictions of classfiers (probabilistic or deterministic) "know" about all classes, and not just those seen in training. This is possible because training data is a categorical array which tracks all the classes, even after subsampling. But this does involve some gymnastics which might have been better designed.

  1. Should I name my Decision Trees DecisionTree, so that the user then select the desired one with the pkg keyword (there are already two available in MLJ) or would be preferable a more specific name, like BetaMLDecisionTree ?

I suggest that you use DecisionTree unless you are implementing something different from CART.

Thank you!

Your welcome.

(PS: the "Simple User Defined Models" still refers to MLJBase rather than MLJModelInterface. )

sylvaticus commented 3 years ago

Thank you :-) I'll work on your comments.

I am not sure I understood your side question, as DecisionTree.jl models seems to me that they work with any categorical feature (like BetaML trees/forests), not necessarily ordered ones:

data = Union{Float64,String}[
    0.9 0.6 "black" "monitor"
    0.3 0.2 "white" "paper sheet"
    4.0 2.2 "grey" "car"
    0.6 0.5 "white" "monitor"
    3.8 2.1 "red" "car"
    0.3 0.2 "red" "paper sheet"
    0.1 0.1 "white" "paper sheet"
    0.3 0.2 "black" "monitor"
    0.1 0.2 "black" "monitor"
    0.31 0.2 "white" "paper sheet"
    3.2 1.9 "grey" "car"
    0.4 0.25 "white" "paper"
    0.9 0.4 "black" "monitor"
]

X = data[:,[1,2,3]]
y = convert(Vector{String},  data[:,4])
((xtrain,xtest),(ytrain,ytest)) = Utils.partition([X,y],[0.7,0.3])
ytrain = dropdims(ytrain,dims=2)
ytest = dropdims(ytest,dims=2)

n_subfeatures=-1; n_trees=30; partial_sampling=1; max_depth=26
min_samples_leaf=2; min_samples_split=2; min_purity_increase=0.0; seed=3
model = build_forest(ytrain, xtrain,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase;
                     rng = seed)

modelβ = buildForest(xtrain,ytrain,5)

ŷtest  = apply_forest(model,xtest)
ŷtestβ = Trees.predict(modelβ,xtest)

acc  = accuracy(ŷtest,ytest)
accβ = accuracy(ŷtestβ,ytest)
sylvaticus commented 3 years ago

Hello, I did add the metadata.

While I do local testing using the fit/predict and the machine/evaluate workflow, how do I test the MLJ interface for the automatic model discovery and that the @load macro work with my models ?

sylvaticus commented 3 years ago

It seems that adding the metadata_pkg in the way I did does create a CI test error in Julia > 1.3:

ERROR: LoadError: Evaluation into the closed module `Trees` breaks incremental compilation because the side effects will not be permanent. This is likely due to some other module mutating `Trees` with `eval` during precompilation - don't do this.
Stacktrace:
 [1] eval at ./boot.jl:331 [inlined]
 [2] eval(::Expr) at /home/runner/work/BetaML.jl/BetaML.jl/src/Trees.jl:44
 [3] metadata_pkg(::Type{T} where T; name::String, uuid::String, url::String, julia::Bool, license::String, is_wrapper::Bool, package_name::String, package_uuid::String, package_url::String, is_pure_julia::Bool, package_license::String) at /home/runner/.julia/packages/MLJModelInterface/McKyQ/src/metadata_utils.jl:73
 [4] #31 at ./broadcast.jl:1245 [inlined]
 [5] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
 [6] _broadcast_getindex at ./broadcast.jl:621 [inlined]
 [7] #19 at ./broadcast.jl:1046 [inlined]
 [8] macro expansion at ./ntuple.jl:50 [inlined]
 [9] ntuple at ./ntuple.jl:45 [inlined]
 [10] copy at ./broadcast.jl:1046 [inlined]
 [11] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.Style{Tuple},Nothing,Base.Broadcast.var"#31#32"{Base.Iterators.Pairs{Symbol,Any,NTuple{6,Symbol},NamedTuple{(:name, :uuid, :url, :julia, :license, :is_wrapper),Tuple{String,String,String,Bool,String,Bool}}},typeof(MLJModelInterface.metadata_pkg)},Tuple{NTuple{4,DataType}}}) at ./broadcast.jl:837
 [12] top-level scope at /home/runner/work/BetaML.jl/BetaML.jl/src/BetaML.jl:53
 [13] include(::Function, ::Module, ::String) at ./Base.jl:380
 [14] include(::Module, ::String) at ./Base.jl:368
 [15] top-level scope at none:2
 [16] eval at ./boot.jl:331 [inlined]
 [17] eval(::Expr) at ./client.jl:467
 [18] top-level scope at ./none:3
in expression starting at /home/runner/work/BetaML.jl/BetaML.jl/src/BetaML.jl:53
ERROR: LoadError: LoadError: Failed to precompile BetaML [024491cd-cc6b-443e-8034-08ea7eb7db2b] to /home/runner/.julia/compiled/v1.5/BetaML/8OYl3_wetbZ.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1305
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923
 [6] include(::String) at ./client.jl:457
 [7] top-level scope at /home/runner/work/BetaML.jl/BetaML.jl/test/runtests.jl:14
 [8] include(::String) at ./client.jl:457
 [9] top-level scope at none:6
in expression starting at /home/runner/work/BetaML.jl/BetaML.jl/test/Utils_test.jl:4
in expression starting at /home/runner/work/BetaML.jl/BetaML.jl/test/runtests.jl:13
ERROR: Package BetaML errored during testing
Stacktrace:
 [1] pkgerror(::String) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Types.jl:52
 [2] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:1578
 [3] test(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:327
 [4] #test#61 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:67 [inlined]
 [5] test(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol},NamedTuple{(:coverage,),Tuple{Bool}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Pkg/src/API.jl:80
 [6] top-level scope at none:1
Error: Process completed with exit code 1.

I don't have this error when testing locally, and if I remove the MLJ interface it works: https://github.com/sylvaticus/BetaML.jl/actions/runs/603243482

It seems related to an eval call on metadata_utils.jl, a lillte bit like in this issue.

How (where) should I then add the MLJ metadata for the interface ?

ablaom commented 3 years ago

It seems related to an eval call on metadata_utils.jl, a lillte bit like in this issue. How (where) should I then add the MLJ metadata for the interface ?

I think this is essentially a bug in the metadata_utils.jl macro which turns up when your interface code is wrapped in a module. Is that what you are doing? We should raise an issue. I've raised an issue. In the meantime, if you could define all the metadata without the macro, as in this example. See also these docs.

edit Or, see workaround discussed in next comment.

How do I test the MLJ interface for the automatic model discovery and that the @load macro work with my models ?

This is a bit tricky to do yourself. Point me to your code, I'll check the load_path looks right and test discovery by adding your package to the model registry (locally until tested). It's easiest if you tag a new release for this, but in principle I can use a branch to test first.

Re: side question

as DecisionTree.jl models seems to me that they work with any categorical feature (like BetaML trees/forests), not necessarily ordered ones:

Yes it will "work" but an ordering is implicitly used, which means you don't get the optimal splitting at each node, only splittings based on a separation of the classes consistent with the ordering; see this issue. It is for this reason that we restricted the input_scitype to tables with Continuous or OrderedFactor columns in our MLJ wrap. This discussion is similarly relevant to your choice of input_scitype.

You can get the optimal split using an ordering depending on the node; this ordering is defined (in case of regression) by the mean value of the target on each class, for training data that arrives at the given node. Breiman says somewhere that this works (finds the optimal split), but I can't remember if I found a proof. I implemented this here but this is very old poorly tested code 😢

ablaom commented 3 years ago

Okay, see also this issue which states: "A quirk of submodules and evaluation scopes makes it necessary to load this submodule in the package init function."

sylvaticus commented 3 years ago

Hello, yep, wrapping the metadata_pkg function within __init__ did work. The MLJ interface is on https://github.com/sylvaticus/BetaML.jl/blob/master/src/BetaML.jl for metadata_pkg and on https://github.com/sylvaticus/BetaML.jl/blob/master/src/Trees_MLJ.jl for the specific Trees models (the only one wrapped for now).

Concerning the nature of categorical features allowed, you was correct. DecisionTrees.jl models work only with features that are sortable. BetaML.Trees model was also using sort for categorical values, but only for an optimisation trick: the elements are searched individually, but at time of then making the split of the node, it was faster to sort the feature once and for each possible value use searchsort() that to loop over all records for all the possible values. I now added a check that if the feature is sortable, I still employ the trick (it shoudn't have any influence on the output), if it is not, I loop over all the values. It will be rather slow, bot honestly features that are not sortable are pretty rare (if one would really optimise this, could maybe add a random number associated to each record to make even multiclass entries sortable). A test concerning the new functionality is here.

So, yes, please let me know if the interface registration is working. I will now work to the hyper-parameters range definition, but before moving to wrap the other models, I would like to have a feedback from you :-)

EDIT: What hyperparameter_ranges is for ? Is it really needed ? On the documentation I see that it is the user (as it seems more natural, again depending on problem/data...) to choose the model (hyper) parameters ranges to look over..

ablaom commented 3 years ago

Your metadata looks good to me. Are you willing to tag a new patch release? That will make testing the discoverability easier. I could hold off releasing the updated registry until your say-so, if you like.

So, yes, please let me know if the interface registration is working. I will now work to the hyper-parameters range definition, but before moving to wrap the other models, I would like to have a feedback from you :-)

EDIT: What hyperparameter_ranges is for ? Is it really needed ? On the documentation I see that it is the user (as it seems more natural, again depending on problem/data...) to choose the model (hyper) parameters ranges to look over..

You can just ignore this one for now. The idea is you specify default ranges over which to optimise each hyperparamter. Caret has this for all their models, which I understand is dearly loved, so we added this trait for future use.

You could implement it, but it would mean adding as a MLJBase as a dependency, because that's where the ParamRange objects currently live. But I don't really recommend this.

sylvaticus commented 3 years ago

Hello, I just issued a new release. On top of the Decision Tree / Random Forests models (DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier and RandomForestRegressor) I added the following perceptron-like classifiers: PerceptronClassifier, KernelPerceptronClassifier, PegasosClassifier (with multi-class support).

I am well aware that they are well obsolete now, but they could be interesting for historical reasons or to allow comparisons with more performant/new models.

ablaom commented 3 years ago

Congratulations on the new release 🎉 . And many thanks for implementing those extra models.

I will add your package to the registry and test discoverability of the models shortly.

ablaom commented 3 years ago

Worked first time. I don't think that ever happened before!

julia> using MLJ

julia> models("BetaML")
7-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = DecisionTreeClassifier, package_name = BetaML, ... )    
 (name = DecisionTreeRegressor, package_name = BetaML, ... )     
 (name = KernelPerceptronClassifier, package_name = BetaML, ... )
 (name = PegasosClassifier, package_name = BetaML, ... )         
 (name = PerceptronClassifier, package_name = BetaML, ... )      
 (name = RandomForestClassifier, package_name = BetaML, ... )    
 (name = RandomForestRegressor, package_name = BetaML, ... )   

This goes live after I tag a new MLJModels release.

ablaom commented 3 years ago

https://github.com/alan-turing-institute/MLJModels.jl/pull/364

ablaom commented 3 years ago

New models become live when https://github.com/JuliaRegistries/General/pull/31325 merges.

Thanks again for this contribution.

Am closing but feel free to add API-related queries to this thread.