JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.76k stars 157 forks source link

Task design discussion #166

Closed juliohm closed 4 years ago

juliohm commented 5 years ago

Dear all,

I've started reading MLJBase in an attempt to develop spatial models using the concept of tasks. Is it correct to say that the current implementation of tasks requires the existence of data?

I would like to specify tasks in a more general context without data. This is useful for example to define problems where the data are not just "tables", but have other interesting properties.

I appreciate if you can comment on how to split tasks from data, and how I can help with this split.

fkiraly commented 5 years ago

Maybe one more thing that may be helpful is the comparison to learning strategies: there's very strong arguments in modularizing and abstracting these from the data in the first instance.

You may want to apply many strategies to the same data - that's the case in benchmarking or model building. You may want to apply the same strategy to many data sets - that's one of the key reasons to study statistical models.

This further holds through applying compositors and wrappers - if you apply grid search etc, a common design does not tie the data to that.

Though of course the "machine" design does do that a little bit.

jpsamaroo commented 5 years ago

Forgive me for jumping in without reading the thread in its entirety; it's a bit too large for me to wade through in a reasonable amount of time.

Let me first say that I agree with @juliohm that defining tasks shouldn't necessarily require attaching concrete data to them. Consider the following example use case: I wish to train an ML model that can be fed images from a video camera, and do things like differentiate between my face and the face of someone who isn't me, identify facial expressions, etc. (suffice to say, a variety of different tasks). This camera is attached to an embedded Linux computer which has a very small amount of storage space, on the order of 512MB total, with most of that being taken up by system libraries and other OS files. This combined device is designed to be user-trainable in real-time for ease of use. Given the limited amount of space available on the device, it's not possible to store anything more than a few frames of "training" data. The data that will be "attached" to the task, will be attached at runtime, as frames come in from the camera.

Now, let's not pick apart this specific example, as I'm sure someone can suggest some clever method to store or access lots of training data for this specific use case; the point is that we have a constrained environment where storing any amount of data is exceedingly expensive, and data is really only available in real-time, at runtime.

Here's my solution: do we need to store the actual data in the task struct when the task is created? As an alternative, why don't we simply store a placeholder object which can be replaced or modified at runtime to point to the data that the user has provided for us when it is most convenient for them. This object could contain all the necessary qualities and dispatches to satisfy the MLJ task API, allowing things to work as normal, while taking up practically no physical space. This task would then be very cheap to construct and transfer around, while only needing a single step to attach data to it before being used as normal (and could throw an error if the user tries to use it as training data before said data has been attached).

What do you all think about this proposal? Would there be difficulties in implementing such a placeholder and all the required methods? And would this sort of approach suit your needs @juliohm?

juliohm commented 5 years ago

@jpsamaroo thank you for the follow up. You example with limited resources is one more example where we show that storing the data in the task compromises the entire application of the framework.

I still keep my point of view that tasks are completely independent and have nothing to do with data. Based on that point of view, I don't think a placeholder would make sense.

jpsamaroo commented 5 years ago

@juliohm by "placeholder", I only mean to say "some container with enough information about what the data will be so that various MLJ/MLJBase dispatches work properly".

That said, I can certainly see the case where it's not possible to even specify that much; in such a case, it's my opinion that such a structure shouldn't be a part of MLJ/MLJBase, because you're missing a vital piece of information for specifying a full-fat task. Of course, I'm not saying that such a structure/capability shouldn't be a part of MLJ/MLJBase in spite; instead, I think creating your own data structure (and APIs) for a "reduced" task specification will give you more flexibility, and you can always upgrade it to an MLJ task once you have an appropriate amount of information on hand to fill out the struct.

ablaom commented 4 years ago

Closed in favour of #236

ablaom commented 4 years ago

Quoting @juliohm from #236

I would like to contrast the proposal with what I am currently using in GeoStats.jl to bring more options to the table. The idea of matching models to tasks in our case is implemented via a trait called iscompatible(task, model) as shown here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/learning/traits.jl#L11-L16 Notice also the granularity of tasks there. I have in mind very specific tasks like regression, classification, clustering and other less common tasks that make sense in specific domains of applications. It seems like the tasks proposed above are at the level of supervised and unsupervised, which is a too general level.

I think having tasks are first-class in the framework is quite important to be able to fully specify a learning problem without talking about models and solution strategies. See for example the spatial learning problem defined here: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/problems/learning_problem.jl I can define this problem once with a given task type (which can be a DAG of tasks), and solve the same problem with multiple models that are compatible with the task. For example, I can define a DAG of tasks, and solve this DAG of tasks in parallel with learning networks from MLJ if the learning network is compatible with the DAG task.

My opinion is that we will not be able to exploit the full capabilities of the Julia language without a clear separation between ML problems and ML solution strategies. The separation requires a method to fully specify the ML problem, and in turn this method relies on the notion of learning tasks as described above. Can you please share your thoughts?

Your proposal seems to have some overlap with the existing MLJ model composition interface. As far as I am aware, this interface is the most powerful of its kind for a general machine learning toolbox. It is certainly at least as powerful as those provided by R or scikit-learn.

To help understand your task API suggestion, and how it fits in with the existing composition interface, it might be useful to describe a couple of common "mainstream" use cases (ie outside of spatial learning problems) that you are trying to address here.

juliohm commented 4 years ago

Thank you @ablaom for reopening the issue, and for quoting my comment from the other thread.

I would like to emphasise again the importance of separating problem specification from solution strategies. I understand that MLJ learning networks and model composition features are about solution strategies, i.e. how to manipulate data, transform it, concatenate it, etc. with functions and learning models. The tasks on the other hand are about learning problem specification, i.e. how to fully specify a learning problem to be solved by different models that are compatible with the task. I also understand that MLJ networks and model composition are features that can be used to solve learning problems specified with composite tasks (i.e. DAGs of tasks):

Problem specification with tasks

Regress y from x, and classify c from a and b. Then classify w from y and c.

Solution strategy with MLJ networks

Create a network that consumes x to produce y with linear regression, consumes a and b to produce c with a decision tree classifier, then consumes both y and c to produce w with a nearest neighbour classifier.

Alternatively, create a network similar to the one above but doing extra intermediate steps. Or a network with different MLJ models as components.

Can you see the value in the separation of these two concepts? Also, do you agree that we can fully specify the problem without taking about specific instances of data? Our problem is specified in terms of learning variables (in my case these variables are spatial, but in other cases these variables could be text-based, time-series, etc.), and not in terms of tables, low-level operations on these tables, nor application of models.

If we can separate these two concepts, we will be able to specify the problem once for any type of data (or a subset of kinds of data) and then use different solution strategies that are dispatched on specific kinds of data. For example we could specify a regression problem x -> y generically, and solve it with a solution strategy where x and y are spatial variables (my use case) or another solution strategy where x and y are time-series, etc.

Please let me know if something is not clear and I can try clarify it further.

fkiraly commented 4 years ago

you can't go on vacation without @ablaom starting to tear down major interface design points :-)

regarding the discussion about tasks, I think the proposed change in #236 is a bad idea for three reasons: (i) statistical - tasks are not uniquely defined by input/output scitypes! An important caveat is that both inputs ans outputs may be endowed with interpretation - e.g., i.i.d. assumption yes/no on-line vs off-line, etc. Granted that this isn't too severe a problem in the "classical" tabular world, but I'd say it already is visible there and will shoot us in the foot (in my opinion) in the "interesting" realms, such as geospatial or temporal. (ii) organisatorial - I don't think we should be changing major interface points without having at least a high-level summary design document pre/post. This is just to avoid domino effect and "painting ourselves into a corner", especially in a case of departure from an existing design (here: mlr2/3). For the discussion at hand, it would mean putting down the major classes/structs, how they interact, how use cases look like. (iii) interface & usage - I think you get unnecessary proliferation of argument types in trying to match models to data. To me, it seems somewhat opaque to the user how you would use it. Similar issues appear if you try to interface from the model side - to me it seems you run into a lot of case distinctions.

fkiraly commented 4 years ago

@juliohm , I agree mostly with your points, and in particular on that we shouldn't be removing "tasks".

Regarding the kind of solution strategies you talk about: indeed, solving a complex learning problem with a solution to a simpler task is an important strategy in advanced scenarios.

Somewhat obviously, it's called "reduction", and we've recently discussed the topic here https://arxiv.org/abs/1909.07872 in the context of the time series related sktime toolbox.

I believe (and sktime dev team agrees/disagrees to various degrees) that explicit task objects are the right way to formalize and abstract reduction. Reduction strategies then would be higher-order compositors, e.g., "make [forecaster] by [sliding window tabulation] and then applying [favourite supervised learner]"

@ablaom 's favourite concept of learning network, I think, maps very nicely on pipelines and reduction strategies put into a DAG. At each stage, for each partial DAG, there's an implied task or learner scitype (which is a concept that @ablaom doesn't like).

fkiraly commented 4 years ago

@juliohm , however, I would slightly disagree on concepts:

What you say is the problem specification is already a scitype of a composite strategy.

The specific task would be "classify w given a, b, c, y". The "Regress y from x, and classify c from a and b. Then classify w from y and c." would already be a strategy to the above - more precisely, the scitype of a strategy that you can fill with particular learners.

Why: ultimately, you would evaluate just how well w has been classified. A task is semantically defined (in my opinion) by the applicable class of evaluators (here: estimating classification generalization loss on i.i.d. sample).

ablaom commented 4 years ago

@juliohm Thanks for the use-case description, which clarifies things nicely. My apologies for the delay in replying. I have been on holiday.

In MLJ 0.5.0 I have made exported learning networks mutable again, which allows one to deal with such use-cases, although not in the way you suggest. While it is true that learning networks themselves are bound to particular component model choices, when you export the learning network (using, for example, the @from_network macro) you now obtain a completely generic object, namely an instance of a new model type, whose fields (hyperparameters) are the component models, and which can be swapped out with new values.

This notebooks/script details your use-case, but see also the simpler demonstration below.

In some preceding versions of MLJ one could mutate the fields of an exported model (that is, change nested hyperparameters) but not replace the field values themselves. Complete mutability does introduce possibilities for abuse, however.

Instead of defining composites by describing a class of models from which each component model is to be selected, the existing API constructs them through a kind of prototyping: explicitly specify some models and then make these the default values in the final model type.

Let me clarify this pattern further with a simple example. The following code defines a new composite model type WrappedRegressor with one field called regresssor, whose default value is DecisionTreeRegressor(), and also an instance of the new model type, called comp. The wrapping just inserts a one-hot encoding preprocessor. The instance is evaluated against some data and the evaluation is repeated with regressor replaced with a new value, KNNRegressor().

using MLJ

## DEFINING THE LEARNING NETWORK

X = source()
y = source(kind=:target)

hot = OneHotEncoder()
Xcontinuous = transform(machine(hot,  X), X)

rgs = @load DecisionTreeRegressor
ŷ = predict(machine(rgs, Xcontinuous, y), Xcontinuous)

## EXPORTING THE LEARING NETWORK AS STAND-ALONE MODEL

comp = @from_network WrappedRegressor(regressor=rgs) <= ŷ
julia> comp
WrappedRegressor(regressor = DecisionTreeRegressor(pruning_purity_threshold = 0.0,
                                                   max_depth = -1,
                                                   min_samples_leaf = 5,
                                                   min_samples_split = 2,
                                                   min_purity_increase = 0.0,
                                                   n_subfeatures = 0,
                                                   post_prune = false,),) @ 1…29

## USING THE COMPOSITE MODEL

Xnew = (a=categorical(rand("abc", 20)), b=rand(20))
ynew = rand(20)

evaluate(comp, Xnew, ynew, measure=rms).measurement[1] # 0.318637

# change the regressor being wrapped:
comp.regressor = @load KNNRegressor

evaluate(comp, Xnew, ynew, measure=rms).measurement[1] # 0.360605

At present there is no restriction on what models can be used to replace the default values in an exported composite model, which is the abuse mentioned above. On the other hand, the present API has some advantages:

The possbility for abuse could be mitigated in various ways, and one way would be along the lines you have suggested: Attach to each component model in the DAG a specification of what other models are allowed there. Rather than formalizing this specification with a new object (you call it a "task") I would prefer specifying a Bool-valued function on models, where "model" now refers an entry in the MLJ registry (these are named tuples specifying all model traits). One can use such functions already to do model search (please see the relevant section of the manual). This would, in my opinion, be more flexible and more future proof. There is no need for us to agree on what attributes (model traits) the new struct would have; and if more traits are added, there is no need to make a breaking change the struct definition later.

juliohm commented 4 years ago

Thank you @fkiraly for sharing your views. Below are my answers to some of your comments.

What you say is the problem specification is already a scitype of a composite strategy.

In my snippet example where I separate problem specification from solution strategy, I never mentioned the word scitype. The problem is specified in terms of variables only.

The specific task would be "classify w given a, b, c, y". The "Regress y from x, and classify c from a and b. Then classify w from y and c." would already be a strategy to the above - more precisely, the scitype of a strategy that you can fill with particular learners.

I disagree. I think we need to specify the problem in terms of inputs and outputs. For example, consider the case above where the variables y and c are created on the fly by the models. We need to specify this issue by the time we define the problem because saying "classify w given a, b, c, y" is too generic or too specific depending on how you read it: it can mean either 1) an under-specified problem or 2) a single and simple classification problem with 4 features. Also, you discuss scitypes again and I think scitypes are not directly related to this present discussion.

I keep my view that the current design is lacking essential components to describe learning problems, and that learning networks don't seem sufficient to achieve a level of generality that other programming languages can't.

juliohm commented 4 years ago

Thank you @ablaom , below you can find my answers to your comments.

In MLJ 0.5.0 I have made exported learning networks mutable again, which allows one to deal with such use-cases, although not in the way you suggest. While it is true that learning networks themselves are bound to particular component model choices, when you export the learning network (using, for example, the @from_network macro) you now obtain a completely generic object, namely an instance of a new model type, whose fields (hyperparameters) are the component models, and which can be swapped out with new values.

I am having a hard time trying to connect learning networks with problem specifications. Also having a hard time trying to understand how the mutability of learning networks is connected with the same specifications.

In some preceding versions of MLJ one could mutate the fields of an exported model (that is, change nested hyperparameters) but not replace the field values themselves. Complete mutability does introduce possibilities for abuse, however.

Again, I am having a hard time to connect the original discussion with mutability of learning networks.

Instead of defining composites by describing a class of models from which each component model is to be selected, the existing API constructs them through a kind of prototyping: explicitly specify some models and then make these the default values in the final model type.

I don't quite understand? Maybe you are saying that solution strategies are not limited to specific models, but different models can be plugged in the learning networks? Isn't that disconnected from the problem specification?

Let me clarify this pattern further with a simple example. The following code defines a new composite model type WrappedRegressor with one field called regresssor, whose default value is DecisionTreeRegressor(), and also an instance of the new model type, called comp. The wrapping just inserts a one-hot encoding preprocessor. The instance is evaluated against some data and the evaluation is repeated with regressor replaced with a new value, KNNRegressor().

Isn't this example illustrating solution strategies with learning networks where regression models are plugged in as default values? Do you see how this example is missing a well-defined problem specification? What is this learning network trying to do? Is it trying to regress something? Assuming we have a simple regression problem, it seems to me that the code is showing how you solve the problem with one-hot encoding and KNNRegressor as the default transforms and models. It is not showing how you represent the regression problem in MLJ so that people not interested in learning networks could propose different solution strategies.

On the other hand, the present API has some advantages:

  • It is avoids an extra layer of abstraction ("tasks" or whatever)

Is this an advantage?

  • The syntax for building a learning network is identical to the syntax with which the user is already familiar from basic fit/predict (see, e.g., my JuliaCon2019 talk).

If this syntax is about the fit/predict, it is not about the problem specification. It is about how you solve the problem (which was not formally defined in the examples above).

  • If one isn't concerned with model re-use, one can just define the learning network and ever bother with defining the composite model type.

Is there a practical use case where users wouldn't be interested in model reuse? I didn't get this comment.

Sorry if I am missing the big picture here. Maybe we should try to set a meeting again to brainstorm? I appreciate the reply.

juliohm commented 4 years ago

I am closing this issue for now as I don't think it will be addressed anytime in the future given the design directions of the project. Thank you for the discussion here anyways.