Curated list of models - Githubissues

juliohm commented 3 years ago

I am opening this issue to discuss the possibility of a curated list of models.

Right now end-users are forced to rely on a non-trivial macro @load that fails depending on the scope (local vs. global) and can be considered advanced for newcomers.

My opinion is that a curated list should be the recommended workflow where users don't need to bother installing dependencies manually:

using MLJ

# well-tested models available
m1 = DecisionTreeClassifier()
m2 = KNeighborsClassifier()
...

This curated list could be made a dependency of the umbrella package. I don't think users would complain about too many dependencies given that any modern ML pipeline nowadays runs dozens of models at least.

cc: @DilumAluthge

DilumAluthge commented 3 years ago

This curated list could be made a dependency of the umbrella package. I don't think users would complain about too many dependencies given that any modern ML pipeline nowadays runs dozens of models at least.

Can you clarify what the "umbrella package" is?

If the "umbrella package" is MLJ.jl, then I would definitely complain. I don't want ] add MLJ to install the entire kitchen sink.

What's wrong with asking users to install MLJCuratedModels.jl if they want the curated list?

DilumAluthge commented 3 years ago

If the "umbrella package" is MLJ.jl, then I would definitely complain. I don't want ] add MLJ to install the entire kitchen sink.

For example, the ensemble functionality lives inside MLJ.jl. I would be quite annoyed if I had to install a whole bunch of unrelated packages just so I could use MLJ's ensemble functionality.

DilumAluthge commented 3 years ago

Now, on the other hand, if we first moved ALL of the functionality out of MLJ.jl into other repos, then I would have no problem adding a whole bunch of dependencies to MLJ.jl.

But as long as there is functionality in MLJ.jl that is not available in another package (MLJBase.jl, etc.), then I am opposed to adding lots of dependencies to MLJ.jl.

DilumAluthge commented 3 years ago

So I guess the two options are:

Keep MLJ.jl the way it is, and put the curated list in a separate MLJCuratedModels.jl package.
Move ALL of the actual features/functionality out of MLJ.jl into separate packages. Once this process is done, we can add MLJCuratedModels.jl as a dependency of MLJ.jl.

ablaom commented 3 years ago

For the record, MLJ is not intended to load any code, but still has the ensemble.jl stuff. The plan has always been to remove this. Maybe there are few other small things too, I forget.

Also, @load has been recently improved to eliminate some possible strange behaviour. And - after https://github.com/alan-turing-institute/MLJModels.jl/issues/244 is complete (almost there!) - @load should work from within packages for any model (only KNN models still use Requires.jl).

I very much like @DilumAluthge 's proposal https://github.com/alan-turing-institute/MLJModels.jl/pull/346 to address the beginner's problem.

@juliohm What do you think?

ablaom commented 3 years ago

Also, if you want to directly load a model (no macros) you can do load_path to find out the location:

julia> load_path("PCA")
"MLJMultivariateStatsInterface.PCA"

julia> load_path("RandomForestRegressor")
ERROR: ArgumentError: Ambiguous model name. Use pkg=... .
The model RandomForestRegressor is provided by these packages:
 ["DecisionTree", "ScikitLearn"].

Stacktrace:
 [1] info(::String; pkg::Nothing) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/model_search.jl:80
 [2] load_path(::String; pkg::Nothing) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/loading.jl:32
 [3] load_path(::String) at /Users/anthony/.julia/packages/MLJModels/GyILf/src/loading.jl:32
 [4] top-level scope at REPL[16]:1

julia> load_path("RandomForestRegressor", pkg="ScikitLearn")
"MLJScikitLearnInterface.RandomForestRegressor"

julia> using MLJScikitLearnInterface

julia> import MLJScikitLearnInterface.RandomForestRegressor

julia> RandomForestRegressor()
RandomForestRegressor(
    n_estimators = 100,
    criterion = "mse",
    max_depth = nothing,
    min_samples_split = 2,
    min_samples_leaf = 1,
    min_weight_fraction_leaf = 0.0,
    max_features = "auto",
    max_leaf_nodes = nothing,
    min_impurity_decrease = 0.0,
    bootstrap = true,
    oob_score = false,
    n_jobs = nothing,
    random_state = nothing,
    verbose = 0,
    warm_start = false,
    ccp_alpha = 0.0,
    max_samples = nothing) @245

juliohm commented 3 years ago

I think my concern is twofold: (1) we still need manual intervention to get a new model into an existing session. This could be addressed with a prompt installation option yes/no triggered by @load whenever a package is missing and the user could just press ENTER. (2) We have too many implementations of the same model and the user doesn't know which one to use. This could be solved with a curated list of "best" well-maintained, pure Julia implementations. For example, DecisionTree.jl is quite mature now and it doesn't make much sense to load sklearn trees or other tree implementations from other languages. I guess we can find similar examples where a single best implementation in pure Julia could be promoted to new Julia users. Keep in mind that a beginner user just wants to load a decision tree, no matter where it comes from, no matter the internal implementation details. He just wants something well-tested that works.

juliohm commented 3 years ago

For the record, MLJ is not intended to load any code, but still has the ensemble.jl stuff. The plan has always been to remove this. Maybe there are few other small things too, I forget.

I fully support this idea. MLJ.jl would therefore provide a more user-friendly installation for users who are not writing packages, but actually writing ML pipelines for solving their problems with various models from a curated list. Advanced users seeking a more lightweight dependency to add to their own packages could be using a subpackage of the MLJ.jl stack like MLJBase.jl and MLJModelInterface.jl, and possibly a MLJEnsemble.jl.

In summary, one must always keep in mind two types of users:

Users who want to write ML pipelines with well-tested and readily available models, who don't care about a long list of dependencies in their final application or Pluto notebook.
Package writers who want to interface with the MLJ stack and use a subset of the functionality encountered in subpackages like MLJBase.jl, but cannot afford a dependency on model packages like DecisionTree.jl

JuliaAI / MLJ.jl

Curated list of models #716