JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 157 forks source link

Method to match models to learning tasks, without formal task objects #236

Closed ablaom closed 4 years ago

ablaom commented 5 years ago

After some lengthy reflection, I would like to provide an alternative to abstract tasks, which so far are used to match models to machine learning tasks. It is not that I object to a formal task interface per se, but that I feel:

  1. The current functionality provided by tasks can be obtained in a more transparent and flexible way with no greater method footprint (see proposal below).

  2. If a formal task interface offers advantages elsewhere in interface (for example, in benchmarking) these are not currently obvious to me.

Perhaps formal tasks might have made more sense if integrated with the composition interface. However, the current composition interface neither requires such a notion nor is likely to benefit from one, as far as I can see. The MLR3 composition interface is task-based. However, last time I checked, this interface has no extra functionality and is even lacking in key areas (for example, no target transformations).

There has also been some objections to tasks wrapping data (#166)

Proposal

The current task API can stay, but would ultimately be depreciated, or perhaps replaced by something better and built on what is proposed below.

Currently one searches for models by constructing a formal task with the supervised or unsupervised constructors, and doing models(task) to get all the models. The constructors horizontally split the data into source and target, with optional coercion of data to get the right scitypes.

In my proposal, one retains the models method but it is re-purposed: models(test) returns all the metadata model entries model for which test(model) is true. Such an entry includes all model traits and so very arbitrary searches are possible.

Secondly, to streamline the common search case, I would would introduce the matching method, defined as follows:

The method has curried versions, e.g., matching(X, y), so that one can do models(matching(X, y)) to get all supervised models compatible with the data (X, y). To get all such models that are additionally probabilistic, one does models(matching(X, y), model->model.probabilistic_type==:probabilistic) or similar. For sample work-flows, see the notebook referenced in the comment immediately following.

Notes:

The horizontal split and type coercion functions of the existing MLJTask constructors (which are table-specific) can be achieved instead with the unpack method, which is already available in the latest tagged release of MLJ. Here's a usage example:

y, X =  unpack(channing,
               ==(:Exit),            # y is the :Exit column
               !=(:Time);            # X is the rest, except :Time
               :Exit=>Continuous,
               :Entry=>Continuous,
               :Cens=>Multiclass)

Here channing is a DataFrame with fields :Exit, :Time, :Entry, :Cens. Unlike the task constructors, the table can be split into any number of pieces, one for each test provided as an argument.

General feedback is welcome, but I am particularly interested in feedback on the matching method. I am planning a PR for merge next week.

ablaom commented 5 years ago

MLJ Workflows document referenced above: https://github.com/alan-turing-institute/MLJ.jl/blob/matching/docs/src/common_mlj_workflows.ipynb

juliohm commented 5 years ago

The constructors horizontally split the data into source and target, with optional coercion of data to get the right scitypes.

The constructors of the task? The newly proposed interface also requires data?

I would like to contrast the proposal with what I am currently using in GeoStats.jl to bring more options to the table. The idea of matching models to tasks in our case is implemented via a trait called iscompatible(task, model) as shown here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/traits.jl#L11-L16 Notice also the granularity of tasks there. I have in mind very specific tasks like regression, classification, clustering and other less common tasks that make sense in specific domains of applications. It seems like the tasks proposed above are at the level of supervised and unsupervised, which is a too general level.

I think having tasks are first-class in the framework is quite important to be able to fully specify a learning problem without talking about models and solution strategies. See for example the spatial learning problem defined here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/problems/learning_problem.jl I can define this problem once with a given task type (which can be a DAG of tasks), and solve the same problem with multiple models that are compatible with the task. For example, I can define a DAG of tasks, and solve this DAG of tasks in parallel with learning networks from MLJ if the learning network is compatible with the DAG task.

My opinion is that we will not be able to exploit the full capabilities of the Julia language without a clear separation between ML problems and ML solution strategies. The separation requires a method to fully specify the ML problem, and in turn this method relies on the notion of learning tasks as described above. Can you please share your thoughts?

I am trying to understand how that discussion on #166 influenced this new design. At first the new design doesn't seem to solve the issues raised there.

juliohm commented 5 years ago

Continuing on this discussion. Notice that having tasks as first class allows an interface like the following with two general functions called learn (generalization of fit) and perform (generalization of predict). These two functions are as follows:

learned_model = learn(task, data, model)

and

perform(task, data, learned_model)

This is implemented here and works fine: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/models.jl

Now we can generalize applications and consider cases where someone learns a task with some data and model, and then performs a different task with the learned model.

These are just simple examples of composibility that will be missed without tasks as first class.

ablaom commented 4 years ago

The constructors of the task? The newly proposed interface also requires data?

Sorry for any confusion. The proposal in this issue is not a task interface proposal, although I discuss tasks for context. The current issue presents two low-level methods (that might or might not be used as part of a task interface): The method matching is just a Bool-valued on model/data pairs to assist in identifying if a model's data type requirements are met by the specified data. The unpack method (already implemented) is just a tabular data manipulation tool.

I am responding to your other points in #166, which I have re-opened.