JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.75k stars 157 forks source link

Task design discussion #166

Closed juliohm closed 4 years ago

juliohm commented 5 years ago

Dear all,

I've started reading MLJBase in an attempt to develop spatial models using the concept of tasks. Is it correct to say that the current implementation of tasks requires the existence of data?

I would like to specify tasks in a more general context without data. This is useful for example to define problems where the data are not just "tables", but have other interesting properties.

I appreciate if you can comment on how to split tasks from data, and how I can help with this split.

ablaom commented 5 years ago

Thanks @juliohm . We are keen to have your feedback and involvement in MLJ from the geo stats community.

We have been discussing tasks at our Sprint this week and I would like to share your views there. To clarify, you do think it's good to have a formal notion of task, but the task should only include metadata, rather than actual data?

If so, what do you think the metadata should include, for common tasks (supervised, unsupervised, something else)?

The present design relies on this metadata stored in a supervised task:

If you think tasks are a good idea, what function, apart from model query ("find all models solving given task") do you see them serving?

juliohm commented 5 years ago

Hi @ablaom, thank you for the careful reply.

Yes, I think we would benefit from a concept of learning task that is disconnected from the data. We can brainstorm a hierarchy of learning task types to exploit multiple-dispatch. For example:

# abstract tasks for multiple-dispatch
abstract type LearningTask end
abstract type SupervisedLearningTask <: LearningTask end
abstract type UnsupervisedLearningTask <: LearningTask end

# concrete tasks to be solved by learning models
struct RegressionTask <: SupervisedLearningTask
  # relevant metadata
end

struct ClassificationTask <: SupervisedLearningTask
  # relevant metadata
end

struct ClusteringTask <: UnsupervisedLearningTask
  # relevant metadata
end

seems reasonable and expressive to me, what do you think? Regarding the metadata for RegressionTask for example, I think that the name of the regressed variables as a list of Symbol, and optionally the name of the regressors, is all that we need. Details about the about the input like the Union type, multivariate vs. univariate, etc seem more a function of the particular dataset than the actual regression task. We could have an additional wrapper type like MLJProblem that encapsulates the task plus the data, and then the queries about the inputs would be implemented for this type instead of in the task level:

struct MLJProblem{T<:LearningTask}
  task::T
  data # some relevant type
end

is_multivariate(p::MLJProblem) = # do some query on p.data
input_type(p::MLJProblem) = # do some query with p.data and p.task

Why is it important to have this separation of task and data? In my case I need to solve spatial problems instead of generic MLJProblem, for example:

struct GeoStatsProblem{T<:LearningTask}
  task::T
  data:SpatialDataType # from GeoStats.jl
end

Now I can implement the Problem interface without touching the tasks from MLJBase. I can simply pre-process some data, manipulate things around and then finally create a traditional MLJProblem to solve.

Please let me know what do you think of this proposal. I can submit a PR if you agree, but I need to understand better the implications of separating the data from the task in the current MLJ implementation.

fkiraly commented 5 years ago

Some relevant discussion:

158

https://github.com/alan-turing-institute/sktime/issues/20 (part 4)

We had long discussions, in sktime, and between teams, what should be a task - and we've noticed that requirements differ between benchmarking and fit/predict workflow.

Your suggestion is essenially option 4 in the sktime thread, @juliohm, right? Where I think "problem" should also specify splits, and perhaps a loss function, as @davidbp suggests in #158 ?

juliohm commented 5 years ago

Hi @fkiraly, thank you for the follow up.

I am not sure if the suggestions made in the linked issues are the same, but I do believe that a separation of these concepts is necessary.

Also, I should add that splitting schemes and loss functions are not part of the problem specification, but are part of the solution strategy. For example, it is only when we decide to use a learning framework such as empirical risk minimisation that we need to specify some loss or cost for each mis-predicted example. The concept of loss is a very specific concept that may not be useful in other learning frameworks (statistical learning theory). The concept of splits is also very solution-driven. If I am interested in model selection of assessment via some cross-validation or bootstrap methods, then I think of splitting mechanisms, otherwise there are methods of assessing generalisation error of hypotheses classes that do not require splitting.

Em qua, 19 de jun de 2019 às 10:59, fkiraly notifications@github.com escreveu:

Some relevant discussion:

158 https://github.com/alan-turing-institute/MLJ.jl/issues/158

alan-turing-institute/sktime#20 https://github.com/alan-turing-institute/sktime/issues/20 (part 4)

We had long discussions above what should be a task - and we've noticed that requirements differ between benchmarking and fit/predict workflow.

Your suggestion is essenially option 4, @juliohm https://github.com/juliohm, right? Where I think "problem" should also specify splits, and perhaps a loss function, as @davidbp https://github.com/davidbp suggests in #158 https://github.com/alan-turing-institute/MLJ.jl/issues/158 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3JVSGV4P4A65NFQQ4DP3I3T7A5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYB6U6A#issuecomment-503573112, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3MUJ5OVM52WUUQPVCTP3I3T7ANCNFSM4HY2XBNA .

ablaom commented 5 years ago

@juliohm

Mmm. I think we have disparate views on what "task" is to mean:

...and optionally the name of the regressors

For me, a list of regressors is part of the strategy for completing/solving the task. We are interested in matching "task" = "objective + metadata" to a list of "strategies" solving the task, from which the user can choose - either one at a time, or systematically. Which is also at odds with:

Details about the about the input like the Union type, multivariate vs. univariate, etc seem more a function of the particular dataset than the actual regression task.

But this is exactly the information required to determine which strategies can solve the problem.

Perhaps it would be helpful for you to explain (i) What function the "tasks" in your terminology are meant to perform, and we explore how this can be achieved in an MLJ integration, and (ii) what aspects of MLJ "tasks", if any, you perceive might get in your way. Regarding:

define problems where the data are not just "tables", but have other interesting properties.

You may be right, but perhaps you could give some concrete examples. One can cast a lot of things as tables or vectors (eg, the input of an image classifier in MLJ is just a vector of Arrays, each with scitype ColorImage). By specifying new scitypes and scitype trait functions, we could presumably catch the kinds of description you're after - or you define your own custom ones?

ablaom commented 5 years ago

@juliohm BTW Any chance you be at JuliaCon to meet?

juliohm commented 5 years ago

Thank you @ablaom, let me try to answer the questions you raised.

Rregarding the definition of task. When I think of a task, I think of a well-defined objective such as predicting some target variable, or clustering some data. We could be more general as well, and adopt the classical definition (Mitchell 1997 - Machine Learning):

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In this classical definition, a task is pretty much anything that an intelligent agent could do: "playing checkers", "driving on a highway using sensors". In this definition it is also clear that the experience E (i.e. training data) is something separate from the task T.

In his book, Mitchell describes well-posed learning problems to be a triplet:

My only critique to this definition of problems is that it includes the performance measure in it. My personal take is that in terms of implementation with actual code in a ML framework, this performance measure shouldn't be part of the problem per se, but something that we can freely change on the same problem. For example, I may check whether or not the learner learned according to different performance measures.

Based on this viewpoint, I was trying to motivate the design in which we consider ML problems to be made of a task plus some data (or experience) on which to learn. I am assuming that our interest in a ML framework is not to answer the question of whether or not we learned, but to use a learned model in unseen data.

Regarding the aspects of the current implementation that may be limiting. I think the example with spatial data is a good one. The fact that data is spatial is much more profound than just flattening or doing some rearrangement of the data. Spatial autocorrelation breaks some of the assumptions in learning theory, and the models developed in spatial statistics are very different. For example, the models are usually "trained" locally within a neighbourhood, and then combined somehow to produce a global model. I am still trying to digest to what extent having the data inside the task object would complicate things, but it doesn't feel natural at least (see classical definition above).

Unfortunately I won't be able to join JuliaCon this year, but I am happy to brainstorm more on any occasion.

Em qua, 19 de jun de 2019 às 15:39, Anthony Blaom, PhD < notifications@github.com> escreveu:

@juliohm https://github.com/juliohm BTW Any chance you be at JuliaCon to meet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3IE7G3BNFYWU4P5MOLP3J4PXA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCY6IA#issuecomment-503680800, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3MC2JJ7DSFAIZDKNTLP3J4PXANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

@juliohm regarding formal ontology/taxonomy, I disagree with your point on loss functions or performance measure. I think you are subject to a confusion, but subject to a confusion that is common in machine learning.

Loss funcitons, or more generally performance measures have two purposes: (i) used in fitting models, say in empirical risk minimization (ii) used in evaluating models, with respect to how well they performed on doing sth (a "task" perhaps).

The key distinction is that not all models operate internally by minimizing a loss, but still all models solving a certain task can be compared by their performance as measured by a loss funcion or quantitative performance measure.

Re-sampling, which may or may not include splits, is used only as part of estimates: these can be estimates of performance that are used in (i) fitting, or (ii) evaluation. There are ways to estimate performance, or fit models, that do not necessarily include re-sampling.

Common confusion in machine learning stem from multiple issues: (a) the Vapnik style statistical learning framework focuses on empirical risk minimization. In the "purist" variant of this framework, all models are fitted and evaluated using the same loss, i.e., (i) happens whenever (ii) happens. (b) Evaluation is usually done via re-sampling. That is, re-sampling happens whenever (i) happens and whenever (ii) happens.

Some examples where these concepts do not coincide: (a) prototype methods, e.g., k-nearest neighbors, are not an instance of empirical risk minimization (well, not that I know of, but even if it were you could write down random algorithms that aren't) (b) you can evaluate models without cross-validation or re-sampling: for example, the R-squared measure in least squares regression or deviance in generalized linear models, with concomitant theorems that tell you why these are (technically/empirically robust) measures of performance. Though what is true that these tend to be specific to model classes. You can tune models without cross-validation or re-sampling: max-likelihood for Gaussian Processes is an example - though it's empirical risk minimization (log-loss).

juliohm commented 5 years ago

Hi @fkiraly, I don't see how you disagree with my observations. Your reply seems to reiterate all the points I raised in my reply? Where is the confusion?

On Thu, Jun 20, 2019, 06:36 fkiraly notifications@github.com wrote:

@juliohm https://github.com/juliohm regarding formal ontology/taxonomy, I disagree with your point on loss functions or performance measure. I think you are subject to a confusion, but subject to a confusion that is common in machine learning.

Loss funcitons, or more generally performance measures have two purposes: (i) used in fitting models, say in empirical risk minimization (ii) used in evaluating models, with respect to how well they performed on doing sth (a "task" perhaps).

The key distinction is that not all models operate internally by minimizing a loss, but still all models solving a certain task can be compared by their performance as measured by a loss funcion or quantitative performance measure.

Re-sampling, which may or may not include splits, is used only as part of estimates: these can be estimates of performance that are used in (i) fitting, or (ii) evaluation. There are ways to estimate performance, or fit models, that do not necessarily include re-sampling.

Common confusion in machine learning stem from multiple issues: (a) the Vapnik style statistical learning framework focuses on empirical risk minimization. In the "purist" variant of this framework, all models are fitted and evaluated using the same loss, i.e., (i) happens whenever (ii) happens. (b) Evaluation is usually done via re-sampling. That is, re-sampling happens whenever (i) happens and whenever (ii) happens.

Some examples where these concepts do not coincide: (a) prototype methods, e.g., k-nearest neighbors, are not an instance of empirical risk minimization (well, not that I know of, but even if it were you could write down random algorithms that aren't) (b) you can evaluate models without cross-validation or re-sampling: for example, the R-squared measure in least squares regression or deviance in generalized linear models, with concomitant theorems that tell you why these are (technically/empirically robust) measures of performance. Though what is true that these tend to be specific to model classes. You can tune models without cross-validation or re-sampling: max-likelihood for Gaussian Processes is an example - though it's empirical risk minimization (log-loss).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3PA6WGYSS7FOZROJETP3NFQBA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYE35LY#issuecomment-503955119, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3LMXXRIIBY2BPLD5Y3P3NFQBANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

@juliohm perhaps the confusion is entirely mine then? Perfectly possible...

You said "The concept of loss is a very specific concept that may not be useful in other learning frameworks" - I disagreed with that on the basis of the distinction (i) use in fitting vs (ii) use in evaluation, thinking you were rejecting general usefulness of losses/scores on the basis of (i) in isolation. Perhaps I got that wrong.

You also said "The concept of splits is also very solution-driven. If I am interested in model selection of assessment via some cross-validation or bootstrap methods, then I think of splitting mechanisms, otherwise there are methods of assessing generalisation error of hypotheses classes that do not require splitting."

I agree with that. Though if you talk about generalisation error, it's usually defined as an expected loss, and perhaps always in terms of quantitative utility/performance measures - even if a model (or "element in the hypothesis class") isn't chosen by loss minimization or utility maximization.

fkiraly commented 5 years ago

And I disagree with your argument that "I am assuming that our interest in a ML framework is not to answer the question of whether or not we learned, but to use a learned model in unseen data." necessitates inclusion of data in the task.

Generally, I don't completely agree with the premise. I think there are two main use cases: (1) making predictions, or more generally applying models (2) obtaining guarantees for how well a strategy, or fitted model, performs on future data

For (1), you need to tell the algorithms what to do - this may be encoded in a task. For (2), you need to benchmark/evaluate. Again, this can be encoded in a task.

But neither strictly necessitates inclusion of the data to the task object, so the argument is incomplete.

juliohm commented 5 years ago

I'm still a bit confused about where we disagree. All the comments seem in agreement.

Regarding the inclusion of data in the task object, what is your opinion?

On Thu, Jun 20, 2019, 09:27 fkiraly notifications@github.com wrote:

And I disagree with your argument that "I am assuming that our interest in a ML framework is not to answer the question of whether or not we learned, but to use a learned model in unseen data." necessitates inclusion of data in the task.

Generally, I don't completely agree with the premise. I think there are two use cases: (1) making predictions, or more generally applying models (2) obtaining guarantees for how well a strategy, or fitted model, performs on future data

For (1), you need to tell the algorithms what to do - this may be encoded in a task. For (2), you need to benchmark/evaluate. Again, this can be encoded in a task.

But neither strictly necessitates inclusion of the data to the task object, so the argument is incomplete.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3MHIXYTCNLMGTF4MNTP3NZURA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYFIM3A#issuecomment-504006252, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3KFVKSJQO7LYFGXL5DP3NZURANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

I'm still a bit confused about where we disagree. All the comments seem in agreement.

Well, then we agree and I've been confused. Much better than the opposite, right?

Regarding the inclusion of data in the task object, what is your opinion?

Changes with the time of day...

More precisely, I see multiple decision points: (i) whether you have one or multiple "task-like" objects, e.g., one heavy-weight one for benchmarking (openML style), and one which is low-weight, say, schema-like (mlr style) (ii) which of the following, if at all, you wish to include: task descriptors ("arguments to fit/predict etc"), pointer to the data, the actual data, re-sampling scheme, loss function(s) for evaluation, control parameters for the strategies/algorithms (e.g., "use at most that much resource") (iii) whether the things in (ii) are optional or not

Regarding inclusion of the data specifically: I am in general wary of doing this. mlr (version 2) is attaching the original data to everything, to tasks, fit results, benchmarking results. This kills the memory with even small datasets to start with. But even if you manage the reference efficiently with pointers, you are essentially creating a "decorated dataset" that you have to manage, with possible attachment of a data accessor facade (eventually may be: database location, hard drive pointer, etc), rather than a task descriptor in isolation that's small in memory.

davidbp commented 5 years ago

My two cents,

I like the idea of involving performance metrics on a task because, at the end of the day, how do you decide whether a learned machine learning model is good? You probably use some metric (or a list of metrics) and the metrics require data.

In ML (leaving Reinforcement learning aside) tasks involve data, learner and an evaluation metric. There are allways hyperparameters to tune, and the standard approach is to use data to tune them. Therefore, It seems natural to involve data with the notion of "task". What does not seem that natural to me is the notion of "machine" involving data (but let's leave this for another issue).

Julio, about: My personal take is that in terms of implementation with actual code in a ML framework, this performance measure shouldn't be part of the problem per se, but something that we can freely change on the same problem.

I'm not sure what to extract. As long as we can add to a task any measure to monitor the performance of the task what is the problem ? You are not required to have a performance measure, (or you might put a dummy one, whatever gets predicted is correct).

Example of a task that does not fit correctly the current MLJ API

NLP tagging task:

In this case

Julio, could you write an example so that we can understand where the API does not allow you to do something? (or you really don't like it how it's done).

juliohm commented 5 years ago

@davidbp your example agrees 100% with what I am saying. See:

You have 3 bullet points, and the task (the first bullet point) is separate from the data (third bullet point). These two live in different objects in memory, different types, and so on. I am saying that the current MLJ approach of putting the data inside the task object is limiting. There is not strong reason to do that. Take for example my GeoStats.jl family of packages. I was able to separate very well problem-related from solution-related concepts. This gives me a lot of flexibility on manipulating things around without worrying about copying data around.

In any case, we all seem to agree that tasks and data are different things, and shouldn't be mixed on a single Julia struct. Correct?

fkiraly commented 5 years ago

What does not seem that natural to me is the notion of "machine" involving data

@ablaom I hope you are reading this :-)

juliohm commented 5 years ago

I saw that @davidbp was editing the comments while I was replying, I don't know if I missed the sentence but in any case. We are discussing about something that we all agree? The fact that data shouldn't be inside the task, correct?

juliohm commented 5 years ago

Let's make it a binary yes/no answer:

Do you think the data should be inside the task?

My answer: no.

fkiraly commented 5 years ago

well, not everyone agrees just because some random people happen to agree somewhere on the internet.

Anyway, in sktime we ended up not including data, but allowing construction using data, the "scitype" of which is then stored in the task. So given my current understanding I'd also lean towards "no".

fkiraly commented 5 years ago

Though if you would have two tasks, one openML style (heavy) for benchmarking, one mlr style (light), but without data, I'd say data should be part of the heavier, openML style "task". Akin to have one "package" the represents a Kaggle-like "challenge".

juliohm commented 5 years ago

@fkiraly could you please elaborate on how the number of tasks affects this decision?

juliohm commented 5 years ago

This representation of "Kaggle-like challenge" seems to be something orthogonal to tasks. It is as the name says, a challenge != task.

davidbp commented 5 years ago

I am not sure I would spent too much time arguing over the definition of a word that is a julia type. Task in MLJ might not mean task in the sense Mitchell describes it in his book.

I don't know about any agreement in the ML community on how the different parts involved on tunning a machine learning pipeline should be called.

Actually not even in neural net layers there is an agreement over how stuff should be called. Some people call linear layers to layers apply an affine transformation (Pytorch). Some people call the same type of layers Dense layers (Tensorflow) even though you can perfectly have Wx+b with "sparse" data. And no one calls them "affine layers" even though it would probably be the most descriptive word for the type of transformation they define.

The definition of a task in MLJ is is a synthesis of three elements: data, an interpretation of that data, and a learning objective. Once one has a task one is ready to choose learning models.

as stated here:

https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_tasks/

I think what would be very valuable @juliohm is to have an example (notebook link for in your repo, for example) where people could see the barrier that you have because of this API.

I can think, for example, a case where the API might feel restrictive: you might have an infinite amount of data. Imagine you work with images and you keep applying crops and transformations to your data. In this case though, you could pass into the task (or the fit method) a function that does that (such as Augmentor.jl).

Do you think the data should be inside the task?

My answer: yes (or probably yes)

Why?: In order to evaluate a (task, model, data) triplet you need the three things. It's like the father-son-God triplet. You can think about (task, model, data) as a single entity of three separated things that form a triplet that you can't split.

Do you think the data should be inside the machine?

My answer: No

Why?: I think of a machine like a cogwheel. With several cogwheels (machines) you can make a bigger cogwheel (machine). Once you have your machine you evaluate it with data.

Possible solution to avoid confusions

Make a list containing (task, model, data) triplets and put it into the documentation. Explain what those words mean in the context of mlj. If necessary rename them as mlj_task, mlj_model, mlj_data so that people does not get confused.

juliohm commented 5 years ago

I disagree David. If the MLJ task concept includes the full definition of a ML problem ready to be solved then it should be named a problem type not a task type. Tasks are just objectives they are not ready to be solved. Only in conjunction with particular experience (data) that we can start learning. Thus, the triplet you mentioned.

Lets make sure that we separate these things clearly otherwise the framework won't attend the needs of many people, including me. Why not adopt the standard terminology established in Mithchell ? What is the rationale for adopting something else? Is there benefit in mixing data with tasks on the same object? Many questions for which I can't find a good answer.

On Thu, Jun 20, 2019, 10:15 David Buchaca Prats notifications@github.com wrote:

I am not sure I would spent too much time arguing over the definition of a word that is a julia type. Task in MLJ might not mean task in the sense Mitchell describes it in his book.

I don't know about any agreement in the ML community on how the different parts involved on tunning a machine learning pipeline should be called.

Actually not even in neural net layers tere is an agreement over how stuff should be called. Some people call linear layers to layers apply an affine transformation (Pytorch). Some people call the same type of layers Dense layers (Tensorflow) even though you can perfectly have Wx+b with "sparse" data. And no one call them "affine layers" even though it would probably be the most descriptive word.

The definition of a task in MLJ is is a synthesis of three elements: data, an interpretation of that data, and a learning objective. Once one has a task one is ready to choose learning models.

as stated here:

https://alan-turing-institute.github.io/MLJ.jl/dev/working_with_tasks/

I think what would be very valuable @juliohm https://github.com/juliohm is to have an example (notebook link for in your repo, for example) where people could see the barrier that you have because of this API.

I can think, for example, a case where the API might feel restrictive: you might have an infinite amount of data. Imagine you work with images and you keep applying crops and transformations to your data. In this case though, you could pass into the task (or the fit method) a function that does that (such as Augmentor.jl). Do you think the data should be inside the task?

My answer: yes (or probably yes)

Why?: In order to evaluate a (task, model, data) triplet you need the three things. It's like the father-son-God triplet. You can thing about it as a single entity of three separated things that form a triplet that you can't split.

Do you think the data should be inside the machine?

My answer: No

Why?: I think of a machine like a cogwheel. With several cogwheels (machines) you can make a bigger cogwheel (machine). Once you have your machine you evaluate it with data. Possible solution to avoid confusions

Make a list containing (task, model, data) triplets and put it into the documentation. Explain what those words mean in the context of mlj. If necessary rename them as mlj_task, mlj_model, mlj_data so that people does not get confused.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3N5RTHIAN2542NNUQDP3N7GJA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYFMCDI#issuecomment-504021261, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3NSZR7ELB7KTCXBKC3P3N7GJANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

how the number of tasks affects this decision?

I was referring to number of different types, not number of instances.

E.g., you could be using "task" for the standard fit/predict workflow, and "challenge" whenever you want to send something to a benchmarking orchestrator.

davidbp commented 5 years ago

Well, to be fair, I didn't know about the existence of the notion "Well posed Learning Problem" combining those 3 things.

Well posed Learning Problem definition: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Maybe we could simply change the name of the struct task to WellPosedLearningProblem or LearningProblem. Then, to avoid confusions we could simply add the definition of what a Well posed Learning Problem is in the documentation., making clear that a LearningProblem is a triplet.

juliohm commented 5 years ago

In the snippet of code I drafted in the beginning of the discussion there was a struct for example:

struct MLJProblem{T<:LearningTask,D}
  task::T
  data::D # could also be more general: experience::E
end

We could adopt a more general name without the MLJ prefix as you suggested:

struct LearningProblem{T<:LearningTask,D}
  task::T
  data::D # could also be more general: experience::E
end
davidbp commented 5 years ago

According to the definition you provided shouldn't it be

struct LearningProblem{T<:LearningTask,D,F <:Function}
    task::T
    data::D # could also be more general: experience::E
    performance_metric::F
end
juliohm commented 5 years ago

Although the definition of ML problems in Mitchell's is quite general, it is based on the idea of answering whether or not the intelligent agent is learning (hence the performance measure). Answering this question, however, is a theoretical pursuit, and doesn't play a major role in practical ML applications, at least that is my initial thought. Please let me know if you disagree. In that same comment I thought that maybe the performance measure could be left out of the problem definition, but perhaps there are good reasons to keep it in the definition. It is not clear to me yet what the benefits would be.

I will think more carefully about this. Please let me know if you have ideas of how the performance (as defined in Michell) could be useful.

juliohm commented 5 years ago

Also, it would be fruitful to discuss the result of solving a ML problem:

using MLJ

# setup task, experience, and possibly performance
T, E, P = ...

problem = LearningProblem(T, E, P)

result = solve(problem, learner)

Is this result a smarter version of learner? For example, suppose that the task T is of type RegressionTask. In this case, possible learners are regression models. It seems reasonable to return a fitted regression model as the result. The regression model itself is just a struct containing the hyperparameters (or metadata) of the model. The smarter version of it would be a struct containing these same hyperparameters plus the learned parameters (e.g. weights in a neural network).

If we adopt an interface like this, we can gradually generalize the tasks and experiences to handle real-world problems without a lot of manual pre-processing like converting a high-level task into one of the fundamental ones (i.e regression, classification, ..).

juliohm commented 5 years ago

Dear all,

Do you agree with this reformulation of tasks in MLJBase.jl? Can I submit a PR?

Thank you,

fkiraly commented 5 years ago

@juliohm actually, I disagree - your suggestion is discrepant from, the least common denominator of alternatives that I would posit most other people agree with, but have been "fighting" over which special case should apply. I think these are all subcases of (citing myself) an object which may or may not include: task descriptors ("arguments to fit/predict etc"), pointer to the data, the actual data, re-sampling scheme, loss function(s) for evaluation, control parameters for the strategies/algorithms (e.g., "use at most that much resource") .

Yours includes "experience" which is not clear to me what it should be in the supervised learning case, and "performance", which seems like something you have to estimate ex post facto, e.g., in a benchmarking experiment, rather than already provide it to the task.

Though, your idea of solve seems very interesting - but there again one could think what it should do, and what information it should have, in which format.

We're also planning to have some internal discussions at Turing this week, and had some interesting discussions with @berndbischl who, amongst others, said that data should be part of the task, and explained his new re-design of the experiment interface in mlr3 which reflects some of the other discussions we've had, too.

juliohm commented 5 years ago

Thank you. In that case I will work on my own interface with a light design for tasks. Please feel free to close the issue. I will close it when I have access to a computer.

On Mon, Jun 24, 2019, 06:00 fkiraly notifications@github.com wrote:

@juliohm https://github.com/juliohm actually, I disagree - your suggestion is discrepant from, the least common denominator of alternatives that I would posit most other people agree with, but have been "fighting" over which special case should apply. I think these are all subcases of (citing myself) an object which may or may not include: task descriptors ("arguments to fit/predict etc"), pointer to the data, the actual data, re-sampling scheme, loss function(s) for evaluation, control parameters for the strategies/algorithms (e.g., "use at most that much resource") .

Yours includes "experience" which is not clear to me what it should be in the supervised learning case, and "performance", which seems like something you have to estimate ex post facto, e.g., in a benchmarking experiment, rather than already provide it to the task.

Though, your idea of solve seems very interesting - but there again one could think what it should do, and what information it should have, in which format.

We're also planning to have some internal discussions at Turing this week, and had some interesting discussions with @berndbischl https://github.com/berndbischl who, amongst others, said that data should be part of the task, and explained his new re-design of the experiment interface in mlr3 which reflects some of the other discussions we've had, too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3IQ4OVHH6Q6HIYXLRTP4CEMNA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYMIJ6A#issuecomment-504923384, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3N72M3DOWYCRYPHBMTP4CEMNANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

@juliohm no reason to close this down! This was meant as constructive disagreement rather than dismissal.

In fact, we've had internal discussions today which ended, well, not in agreement.

One thing we were unsure about was how your interface would concretely look like, when invoked by a user for (a) the simple fitting/prediction workflow, and (b) a benchmarking workflow. Also, which workflows it would easily support that the mlr or openML inspired ones don't - e.g., how you imagine solve working.

It would be really helpful if you could write a sketch of a full workflow, i.e., defining variables for data, models, etc, especially for the thing you call "experience".

fkiraly commented 5 years ago

In addition, it appears that @berndbischl 's mlr 3 is completely migrating to something called "experiment", i.e., there's no separation between the (a) fit/predict workflow and (b) benchmarking. I find that a questionable (albeit also defensible) choice.

fkiraly commented 5 years ago

Also, if you have a PR ready, we'd be interested to have a look at it as a design study. Though it wouldn't be guaranteed that we go with it - key issues are (i) how much of existing code it would break and (ii) how @ablaom (who needs to live with it) feels about it.

berndbischl commented 5 years ago

In addition, it appears that @berndbischl 's mlr 3 is completely migrating to something called "experiment", i.e., there's no separation between the (a) fit/predict workflow and (b) benchmarking. I find that a questionable (albeit also defensible) choice.

.....

After your feedback we are now very likely changing that. @mllg is still evaluating that

ablaom commented 5 years ago

@mllg

Re experiments and data vs views of data

You may want to note that in the current MLJ design I have implemented something essentially identical to your "experiments". They are called instead "machines" - an unfortunate choice of name. Perhaps better is "laboratory" where "experiments" are conducted (after specifying train and test rows). Incidentally, I also specify the splitting of data using row indices. This "experiment" design choice stemmed in part from my desire for the pipeline API to resemble closely the simpler single model workflow, and my whole design process began considering pipeline requirements. The nodes in my pipeline, you see, are labelled by experiments, where the explicit data (in your case task) is replaced by references to other nodes. (Unlike your design, as I understand it, two nodes can be labelled (point to) the same experiment, which allows, for example, for the accomodation of target transformation/inverse transformation).

Perhaps a detailed comparison of the pipeline API's can be shared elsewhere.

In MLJ, the user need not explicitly specify when to "update" a model (when, eg, increasing an iteration parameter) rather than "refit" the model from scratch; another role of "experiments" is to remember the last experiment performed, so to speak, so it can dispatch the appropriate lower level fit/update methods accordingly. Perhaps this not a responsibility of experiments in the mlr3 design?

In our private discussions @fkiraly has raised some valid criticism of the machines/experiments and the row-indices specification of splits. These well-informed criticisms not withstanding, I think it significant that the initial MLJ/mljr3 designs have converged on this experiment design choice - I think mostly independently, although possibly for different reasons. So, naturally, I am keen to hear your own reflections on this choice.

juliohm commented 5 years ago

Dear all,

Sorry for the delay, had a busy week. I'd be happy to discuss that proposal further. I've actually started experimenting with that idea since the last time we've discussed here, and you can find some drafts of the code in GeoStatsBase.jl, namely in the tasks.jl and in the problems/learning_problem.jl files.

You can see that in my case, there is a whole new set of data types and other things that make a well-defined problem, and that I don't want to carry the data inside the task because it can inhibit some parallelization algorithms, or complicate them. My tasks are light descriptions of what to do as we've been discussing, and as commonly found in the traditional ML literature.

Regarding the fit/predict interface, please consider the following updated version of a snippet of code I pasted above:

using MLJ

# setup task, experience, and possibly performance
T, E, P = ...

problem = LearningProblem(T, E, P)

smarter = learn(problem, learner)

In this prototype interface, the learner is a ML model. For example, if the task T is a regression task, then possible learners are regression models. The smater object, result of the learn (or "fit") function, is a smarter version of learner (e.g. neural net with updated weights). Someone can take this new object, and then apply it to the same task (or possibly a different task) with a new unseen data set:

eval(smarter, T, newdata)

This is a very rough idea still, just to share with you some random thoughts on how it could look like. I think we can exploit Julia generics features to come up with a powerful API where learners are trained with a given set of tasks and data, and then applied to a possibly different set of tasks, with possibly different data. Think of transfer learning as the main goal here.

fkiraly commented 5 years ago

@juliohm , not sure whether I get this idea concretely.

It would appear to me that it heavily hinges on the what the objects T, E, P are.

What would these be concretely, in terms of data, task meta-information, etc? A few more lines of "imaginary code" would help me understand, much appreciated.

juliohm commented 5 years ago

@fkiraly , the T, E, P are explained in more detail in some of the previous comments:

I am working on this next research for which I will need some of these concepts. I am currently working on some of these in GeoStats.jl, and maybe it will serve as a place for experimentation.

Please let me know if something else is not clear, the idea of splitting data from tasks is quite relevant to me, and possibly other people that wish to use MLJBase.jl as a common base.

fkiraly commented 5 years ago

Oh, so T is meant to be a task just as above? Meaning, your suggestion is not distinct from the suggested task interfaces, but would build on one of the (many) suggested task interfaces?

juliohm commented 5 years ago

The T is meant to be a task without the data, as I suggested in the beginning of this thread. Basically the data lives outside in the E object.

Em sex, 5 de jul de 2019 às 08:29, fkiraly notifications@github.com escreveu:

Oh, so T is meant to be a task just as above? Meaning, your suggestion is not distinct from the suggested task interfaces, but would build on one of the (many) suggested task interfaces?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3IWHAHYVKSVJ5MJZJDP54WDJA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJJNFA#issuecomment-508728980, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3PTCM7QROPYIP4234LP54WDJANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

So as per your latest suggestion, i.e., removing P for the start, is the following correct: the suggestion defaults to the set of "task" options we've already discussed, but with the recommendation made explicit that the data sits outside?

juliohm commented 5 years ago

I am not sure what task options are already implemented in MLJ.jl, but my suggestion is very simple. Keep task and data in separate objects. That way a single task can handle multiple data, and a single data can be used on multiple tasks. Keeping these two together is sub-optimal.

Em sex, 5 de jul de 2019 às 12:38, fkiraly notifications@github.com escreveu:

So as per your latest suggestion, i.e., removing P for the start, is the following correct: the suggestion defaults to the set of "task" options we've already discussed, but with the recommendation made explicit that the data sits outside?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3MIQUP7I7QY2IIXZVLP55THLA5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZJ2DVI#issuecomment-508797397, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3METK62YXNF6VDE6PDP55THLANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

Ok, I see what you are getting at - data yes/no is one of the most prominent questions we discussed internally.

Regarding your arguments, please find comments below. I would be interested to hear what you think, e.g., correct/maybe/wrong&why.

"use one data set with multiple tasks" -> this is not an argument for or against, since in both scenarios you have this easily: in the "task with data" design, all the tasks can just point to the same dataset (represented just once in memory or as reference). Since you have to specify the task in both cases, and the data it points to, none of the two designs seems to have an advantage in terms of usabiliy or semantic expressivity.

"use multiple data sets with the same task" -> that would be a strong argument for the data-free design, if it were to occur frequently. Internally, we couldn't imagine a use case where it would happen naturally. For example, in usual supervised learning, the task would encode which variables to predict from and which one to predict (features & targets). But there don't tend to be two datasets with the same column names!

fkiraly commented 5 years ago

But there don't tend to be two datasets with the same column names!

Actually, there are, come to think of it.

E.g., run the same intervention study in two different hospitals using the same protocol. Set up the same data processing & analytics pipeline at multiple customers.

Is that what you were thinking of? Perhaps we were thinking too much along the line of the UCI style datasets.

fkiraly commented 5 years ago

There we go, now I'm fighting with myself.

juliohm commented 5 years ago

I'm trying to understand the rationale for keeping data inside the task. There is any argument in favor of this design? Data is often the most complicated part of any ML pipeline, specially if it is not just a table. You seem to ignore my comments that users may have data that do not fit MLJ.jl assumptions, and that putting this same data inside the task has the potential to limit the framework applicability because it is being modeled too early. Why I need data to express a regression task? Why I need data to express "play checkers" task? I really don't understand why we are discussing this option. I may want to learn checkers not from data but from other intelligent agents. I may want to learn checkers from an adversary agent. Etc. The task is completely decoupled from the data.

On Fri, Jul 5, 2019, 15:00 fkiraly notifications@github.com wrote:

Ok, I see what you are getting at - data yes/no is one of the most prominent questions we discussed internally.

Regarding your arguments, please find comments below. I would be interested to hear what you think, e.g., correct/maybe/wrong&why.

"use one data set with multiple tasks" -> this is not an argument for or against, since in both scenarios you have this easily: in the "task with data" design, all the tasks just point to the sam dataset (represented just once in memory or as reference)

"use multiple data set with the same task" -> that would be a strong argument for the data-free design, if it were to occur frequently. Internally, we couldn't imagine a use case where it would happen naturally. For example, in usual supervised learning, the task would encode which variables to predict from and which one to predict (features & targets). But there don't tend to be two datasets with the same column names!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/166?email_source=notifications&email_token=AAZQW3KSF2VAWSB6CJUSNXTP56D33A5CNFSM4HY2XBNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKBX7Q#issuecomment-508828670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3NN45TCA43HUC675BDP56D33ANCNFSM4HY2XBNA .

fkiraly commented 5 years ago

You seem to ignore my comments

Hey, please don't be angry, @juliohm . The most frequent causes of not getting a point are: (i) the listener hasn't quite got it properly (ii) the explainer hasn't explained it properly

Cause (iii), the listener is intentionally malicious and has some dark motifs, is rare.

Anyway, there is a strong argument in favour of putting classes together, which is "avoidance of proliferation of classes/modules". One shouldn't split up things unnecessarily, since each unnecessary class/struct/module increases architecture complexity and reduces user experience.

Regarding "why do you need data": because in which case would you define a regression task without tying it to some data eventually? (this question is not entirely rhetoric: let me know if you think it isn't)

Similarly, in "play checkers" you would tie it to some agent interface. Even in advanced reinforcement learning scenarios with policy function fitting or value function approximation you need to specify what you fit the policy function to, so it's not decoupled from data/inputs, but data/puts just takes a different form there.