Closed ablaom closed 4 years ago
MLJ Workflows document referenced above: https://github.com/alan-turing-institute/MLJ.jl/blob/matching/docs/src/common_mlj_workflows.ipynb
The constructors horizontally split the data into source and target, with optional coercion of data to get the right scitypes.
The constructors of the task? The newly proposed interface also requires data?
I would like to contrast the proposal with what I am currently using in GeoStats.jl to bring more options to the table. The idea of matching models to tasks in our case is implemented via a trait called iscompatible(task, model)
as shown here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/traits.jl#L11-L16 Notice also the granularity of tasks there. I have in mind very specific tasks like regression
, classification
, clustering
and other less common tasks that make sense in specific domains of applications. It seems like the tasks proposed above are at the level of supervised
and unsupervised
, which is a too general level.
I think having tasks are first-class in the framework is quite important to be able to fully specify a learning problem without talking about models and solution strategies. See for example the spatial learning problem defined here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/problems/learning_problem.jl I can define this problem once with a given task type (which can be a DAG of tasks), and solve the same problem with multiple models that are compatible with the task. For example, I can define a DAG of tasks, and solve this DAG of tasks in parallel with learning networks from MLJ if the learning network is compatible with the DAG task.
My opinion is that we will not be able to exploit the full capabilities of the Julia language without a clear separation between ML problems and ML solution strategies. The separation requires a method to fully specify the ML problem, and in turn this method relies on the notion of learning tasks as described above. Can you please share your thoughts?
I am trying to understand how that discussion on #166 influenced this new design. At first the new design doesn't seem to solve the issues raised there.
Continuing on this discussion. Notice that having tasks as first class allows an interface like the following with two general functions called learn
(generalization of fit
) and perform
(generalization of predict
). These two functions are as follows:
learned_model = learn(task, data, model)
and
perform(task, data, learned_model)
This is implemented here and works fine: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/models.jl
Now we can generalize applications and consider cases where someone learns a task with some data and model, and then performs a different task with the learned model.
These are just simple examples of composibility that will be missed without tasks as first class.
The constructors of the task? The newly proposed interface also requires data?
Sorry for any confusion. The proposal in this issue is not a task interface proposal, although I discuss tasks for context. The current issue presents two low-level methods (that might or might not be used as part of a task interface): The method matching
is just a Bool
-valued on model/data pairs to assist in identifying if a model's data type requirements are met by the specified data. The unpack
method (already implemented) is just a tabular data manipulation tool.
I am responding to your other points in #166, which I have re-opened.
After some lengthy reflection, I would like to provide an alternative to abstract tasks, which so far are used to match models to machine learning tasks. It is not that I object to a formal task interface per se, but that I feel:
The current functionality provided by tasks can be obtained in a more transparent and flexible way with no greater method footprint (see proposal below).
If a formal task interface offers advantages elsewhere in interface (for example, in benchmarking) these are not currently obvious to me.
Perhaps formal tasks might have made more sense if integrated with the composition interface. However, the current composition interface neither requires such a notion nor is likely to benefit from one, as far as I can see. The MLR3 composition interface is task-based. However, last time I checked, this interface has no extra functionality and is even lacking in key areas (for example, no target transformations).
There has also been some objections to tasks wrapping data (#166)
Proposal
The current task API can stay, but would ultimately be depreciated, or perhaps replaced by something better and built on what is proposed below.
Currently one searches for models by constructing a formal task with the
supervised
orunsupervised
constructors, and doingmodels(task)
to get all the models. The constructors horizontally split the data into source and target, with optional coercion of data to get the right scitypes.In my proposal, one retains the
models
method but it is re-purposed:models(test)
returns all the metadata model entriesmodel
for whichtest(model)
is true. Such an entry includes all model traits and so very arbitrary searches are possible.Secondly, to streamline the common search case, I would would introduce the
matching
method, defined as follows:matching(model, X) == true
exactly whenmodel
is unsupervised and admits inputs with the scientific types ofX
.matching(model, X, y) == true
exactly whenmodel
is supervised and admits inputs and targets with the scientific types ofX
andy
, respectively.matching(model, X, y, w) == true
exactly whenmodel
is supervised and supports sample weights, and furthermore admits inputs and targets with the scientific types ofX
andy
, respectively.The method has curried versions, e.g.,
matching(X, y)
, so that one can domodels(matching(X, y))
to get all supervised models compatible with the data(X, y)
. To get all such models that are additionally probabilistic, one doesmodels(matching(X, y), model->model.probabilistic_type==:probabilistic)
or similar. For sample work-flows, see the notebook referenced in the comment immediately following.Notes:
None of this design is tied to any particular (e.g., tabular) data format.
models(task)
can still be allowed to work with the abstract tasks as currently implemented.The horizontal split and type coercion functions of the existing
MLJTask
constructors (which are table-specific) can be achieved instead with theunpack
method, which is already available in the latest tagged release of MLJ. Here's a usage example:Here
channing
is a DataFrame with fields:Exit, :Time, :Entry, :Cens
. Unlike the task constructors, the table can be split into any number of pieces, one for each test provided as an argument.General feedback is welcome, but I am particularly interested in feedback on the
matching
method. I am planning a PR for merge next week.