Summarize scope of problem

JuliaML / LossFunctions.jl

Julia package of loss functions for machine learning.

https://juliaml.github.io/LossFunctions.jl/stable

Other

148 stars 34 forks source link

Summarize scope of problem #2

Closed tbreloff closed 8 years ago

tbreloff commented 8 years ago

I like that you've kicked this off with real code, but I hope that you're still open to re-thinking everything from the ground up. This may involve changing or replacing code that you already have, but of course there will be a good reason for any changes.

With that said, I think the first important step is to define the scope of the problem as best we can:

What problems/algorithms will we support/interface? Deep neural nets? Convolutions? Random Forests? Data preprocessing? Data cleaning? Model analysis and plotting? We want to make this list as complete as possible so that it's easier to see where these components fit into the next items.
What verbs do we need? Which of those can extend StatsBase, and which need to be new? What exactly does the verb mean? What assumptions can we make when a verb is implemented for a particular model?
What are some examples of how a user would interact with these algorithms/models? How do they relate to each other?

Like I said in the roadmap discussion, I think defining a type hierarchy at the beginning is just asking for failure. Some things will require types, sure, but I do feel like premature-typing tends to ruin some of Julia's strengths. This is not an object oriented language, and building an object oriented framework will limit the power of LearnBase.

Lets compile a full scope of the problem first, and then we can start to discuss design specifics after.

Evizero commented 8 years ago

Before addressing the bullet points in a separate post let me give a general response

The code that I have so far is motivated by what I am working on currently. So it deals with things that I need to have addressed such as target encoding. That being said, I am absolutely open to refactor this or even throw it out completely.

I really want it to be useful for as much of machine learning as possible (this includes reinforcement learning and evolutionary computation). I don't think we should provide data cleaning in this package itself, but at least provide some guidelines for it. So to be explicit I don't think this should provide a convenient front end for all kind of machine learning. I think the goal should be to provide abstractions and design principles and naming conventions for functions (currently you can find everthing fromfit vs train vs optimize vs solve in use)

Concerning StatsBase, I think we should reuse as much as we can and generally follow its principles. That would however most likely result in StatsBase having to provide a higher abstraction than StatisticalModel.

Concerning the type hierarchy. I think you are right to an extend. I also think that it should be simple and intuitive for subpackages to provide a predict function etc for their models without having to deal with data encoding themselves. Also, not every learning model need numeric features, so some can work with dataframes, so the StatsBase/DataFrame ModelMatrix approach is not generic enough.

I like the idea of this cooperation because we both are in the middle of real life projects that would depend on this. That should keep us grounded and focused on real issues.

Evizero commented 8 years ago

To me sandbox-like research is a big goal. So I want to use the fact that Julia has no two language problem and really be able to dig into lowlevel stuff. So for example fitting an SVM can be done using various solvers, and it should be easy for a Researcher to come along and create a new solver without modifying the original package. This is pretty easy with multiple dispatch if the design is adequate.

Concerning the scope of the problem. I think we should at least be able to provide guidelines for common supervised and unsupervised machine learning approaches. Also I think it would make sense to treat non-trivial preprocessing as unsupervised learning and not something super special case that needs special treatment. One could always provide convenience functions additionally. Here is some low level code that shows what I mean:

# Example for unsupervised preprocessing
csfit = fit!(CenterScale(), Xtrain)
predict!(csfit, Xtrain) # "!" makes it inplace 

myfit = fit(MyModel(param = .9), Xtrain, ytrain, maxiter=10)
yhat = predict(myfit, predict(csfit, Xtest))

fit!(myfit, MyModel(param = .9), Xtrain, ytrain, maxiter=10) # train 10 more iterations on same data

I would also like to settle on some generic issue. This is one that I have been thinking of for a while. It is the question of what an AbstractLearner should conceptionally represent.

One approach is that the MyModel itself is the specification of the model's hyper parameters, whereas fit returns a MyModelFit that just contains the solution. I personally think this is the cleanest approach (that I can come up with). It allows for all kinds of things
Another approach is to make the hyper parameters as parameters of the fit function and define the model being fit using ::Type{MyModel} as first parameter. This approach sucks for online learning I think
Yet again another approach (which for example scikit-learn uses) is to make the model mutable and be both, container for the hyper parameters as well as solution. This is the one i currently use for SupervisedLearning.jl because it allows for front end conveniences. But I don't think that it makes too much sense for back-end stuff, where scaling up and out should be simple using only standard language constructs

tbreloff commented 8 years ago

There are some good thoughts here, particularly in thinking about mundane transforms (centering the input data, for example) as a form of unsupervised learning. The thing I like is the concept of chaining one transform into the next in a pipeline fashion generically. What I don't like is the labeling. We should try hard to avoid picking one abstraction and giving it a label. I prefer to have models/transformed labeled implicitly only by which verbs they implement. (In summary, it's not an UnsupervisedLearner, its just a "Learner that is able to transform data"). The difference is subtle but important.

Christof... Please look through the code of OnlineStats and OnlineAI. There are some organizational concepts that I think could be generalized and cleaned up, and which could be a model for how to approach design. Specifically, concepts like loss functions, penalties, link functions, and solvers can be built up in a modular fashion and then chained together in a pipeline (see the stream macro). In many ways a deep neural net is no different than a logistic regression if you break the problem apart cleanly.

On Oct 11, 2015, at 1:12 PM, Christof Stocker notifications@github.com wrote:

To me sandbox-like research is a big goal. So I want to use the fact that Julia has no two language problem and really be able to dig into lowlevel stuff. So for example fitting and SVM can be done using various solvers, and it should be easy for a Researcher to come along and create a new solver without modifying the original package. This is pretty easy with multiple dispatch if the design is adequate.

Concerning the scope of the problem. I think we should at least be able to provide guidelines for common supervised and unsupervised machine learning approaches. Also I think it would make sense to treat non-trivial preprocessing as unsupervised learning and not something super special case that needs special treatment. One could always provide convenience functions additionally. Here is some low level code that shows what I mean:

Example for unsupervised preprocessing

csfit = fit!(CenterScale(), Xtrain) predict!(csfit, Xtrain) # "!" makes it inplace

myfit = fit(MyModel(param = .9), Xtrain, ytrain, maxiter=10) yhat = predict(myfit, predict(csfit, Xtest))

fit!(myfit, MyModel(param = .9), Xtrain, ytrain, maxiter=10) # train 10 more iterations on same data I would also like to settle on some generic issue. This is one that I have been thinking of a while, which is the question of what an AbstractLearner should conceptionally represent.

One approach is that the MyModel itself is the specification of the models hyper parameters, whereas solve or fit returns a MyModelFit that just contains the solution. I personally think this is the cleanest approach (that I can come up with). It allows for all kinds of things

Another approach is to make the hyper parameters as parameters of the fit function and define the model being fit using ::Type{MyModel} as first parameter. This approach sucks for online learning I think

Yet again another approach (which for example scikit-learn uses) is to make the model mutable and be both, container for the hyper parameters as well as solution. This is the one i currently use for SupervisedLearning.jl because it allows for front end conveniences. But I don't think that it makes too much sense for back-end stuff, where scaling up and out should be simple using only standard language constructs

— Reply to this email directly or view it on GitHub.

Evizero commented 8 years ago

What I don't like is the labeling. We should try hard to avoid picking one abstraction and giving it a label. I prefer to have models/transformed labeled implicitly only by which verbs they implement.

I can see that. This does make sense to me. My primary concern are the function names and common function signatures anyway; I can let go of the abstract types idea. I've done something like that in SupervisedLearning.jl

Ok, I'll take a closer look at OnlineStats and OnlineAI. I am sure that I can learn something from it.

Concerning Loss functions etc: sure, if it fits into the empirical risk minimization framework I am absolutely in favour of using that formulation (which I also do in KSVM.jl for SVMs). But then again not everything fits into that framework either

I like the idea of pipelines and we should absolutely consider them, but I think the more basic issues should be addressed first

tbreloff commented 8 years ago

So to be explicit I don't think this should provide a convenient front end for all kind of machine learning.

Have to disagree. If this isn't the goal, then I don't think we're on the same page. I want to have a better and more complete scikit-learn which expands into deep learning and other complex techniques. If the only goal is to define some method names and maybe some abstract types and very basic processing, then I think it's not worth the effort. That approach requires the community to commit to abstractions purely on faith. What we need is the complete front end and link code. Build the framework that everyone wants to use, let it become the new standard naturally, and then work to merge disparate codebases into cohesive modules. (If it does not naturally become the standard, then there's probably a good reason for that, and the community may be better off to continue as-is)

I don't think we should provide data cleaning in this package itself, but at least provide some guidelines for it.

Same as above. There might be some advanced data cleaning techniques that we can forego, but we should provide the basics, even if that means re-exporting and/or wrapping another package (MLBase, etc)

I like the idea of pipelines and we should absolutely consider them, but I think the more basic issues should be addressed first

Again, we don't need a working implementation of anything on day one, but we have to know how this sort of thing fits into a design... otherwise it's too risky to ignore.

My perspective boils down to the ideas:

Give the user easy access to algorithms with smart abstractions (API and link code)
Provide access to core tools (input data preprocessing, model evaluation, etc)
Devise a framework to allow external extensions of missing/advanced functionality (similar to https://github.com/tbreloff/Plots.jl/issues/42)

If you're missing any of those pieces, then you might as well have none. Everyone agrees that this is a hard problem to get right, but as my dad used to say... "no guts, no glory"

Evizero commented 8 years ago

I do think we have the same goal and I am sure we'll understand each other better if we just continue to work and brainstorm on this. So bottom line, I agree with your overall vision (that is, if I understand you correctly). Concerning this package: I think laying out the full scope first is a great idea and we should do that. Concerning implementation: I do think though that we should start simple and iterate on it. i.e.: Once we settle on the verbs I would like my next step to be to try and make KSVM.jl follow that specification and see if any problems or ideas come up. In my experience this kind of practical approach raises questions that no one thought about asking in the first place

To address your post with a little more detail:

A well designed Julia framework in my eyes is spread out into seemingly independent but interoperable packages. What I don't think to be a good idea is to put all kinds of learning algorithms into one Blob-package.

You keep saying scikit learn, and I guess that confused me. I am now assuming you are just loosely referring to it to indicate the scope of functionality that you are after, and not the design / interface. Because I don't think any existing ML framework design from another language is a good choice for Julia. Are we on the same page on that? Let me justify that a little:

Julia in my eyes does not suffer from the two language problem. This opens up new ways to think about visible low-level functionality. As a small example: for SVMs this makes it very feasible to provide a callback function for debugging/didactic purposes or even to use for early stopping.
I don't think it is convincing enough to "just" provide a scikit learn clone in Julia. It would not make use of Julia's power and it would not provide sufficiently different things from python to make anyone come over. What is convincing to me (and some colleagues that I talked about this) is if the Julia ML stuff is very usable for teaching/learning about ML as well as Research in ML as well as application of ML in data science. I think an abstract simple enduser-interface that mimics scikit-learn or caret should be build on top of that. Example: I think on a low level a logistic regression should not really exist as an entity but simply be an empirical risk model with a linear or affine prediction function with some regularzer and a logistic loss. This makes the whole thing much more powerful and extensible. While in an enduser interface (or let's call it data-scientist interface) ala SupervisedLearning.jl you can just say Classifier.LogisticRegression() for convenience.

datnamer commented 8 years ago

Looks like this is shaping up to be a great package, I'm excited to see where it goes ( and help). I also have a couple of thoughts:

Expressing models: R's success is in its ease of use and expressiveness as a DSL front end for expressing data and statistical transforms. Part of this is Dplyr's idea of a "grammar" of verbs at the right level of abstraction, that can be chained together in a pipeline. The other is the formula syntax for expressing statistical models. Is there scope or maintenance capability to create a model expression DSL that evolves on the y~x type of syntax? Something like a probabilistic programming language wherein models can be swapped in and out. No less than Hadley Wickham himself has mentioned the need for such a "grammar of modeling".

I guess part of this would be separating model from the rest of the machinery.
Flexibility vs speed vs inference tradeoff: Traditional ML is scalable but MCMC bayesian inference gives us more insight into the model. TMLE aims for a sweet spot. Would be great to have this in the package: MIT licensed code here: https://lendle.github.io/TargetedLearning.jl/

Blog post here: http://blog.revolutionanalytics.com/2015/03/
Linq like DSL for data transforms: This is probably REALLY out of scope, but my dream would be a data manipulation grammar that is backend agnostic, expressive and can be used in the pipeline. Then models can be defined separate from backend and machinery can be changed out depending on whether we want to fit on a dataframe, stream from a database or something distributed.

https://github.com/JuliaStats/DataFramesMeta.jl https://github.com/blaze/blaze http://multithreaded.stitchfix.com/blog/2015/03/17/grammar-of-data-science/ https://github.com/cloudera/ibis

I know this is alot of stuff...but like @tbreloff said, lets "go big or go home" ;) And apologies if this is rehashing previous discussion.

Edit:

Also Traits might be coming to Julia soon. This will probably have a big impact on the design of packages like LearnBase...might be good to take that into account.

tbreloff commented 8 years ago

"grammar" of verbs at the right level of abstraction, that can be chained together in a pipeline

This is very important to me.

Is there scope or maintenance capability to create a model expression DSL that evolves on the y~x type of syntax?

Check out the @stream macro in OnlineStats. It's my first attempt at this sort of "model pipeline language". I'd like to see something better and more feature-full eventually.

This is probably REALLY out of scope, but my dream would be a data manipulation grammar that is backend agnostic, expressive and can be used in the pipeline.

Not out of scope. I can't imagine not having some version of this available...

Also Traits might be coming to Julia soon. This will probably have a big impact on the design of packages like LearnBase

Yes you are absolutely right, and it's something I'm thinking about.

Glad to have you in the discussion @datnamer and thanks for the links.

datnamer commented 8 years ago

Sounds like we are on the same page!

Evizero commented 8 years ago

I am all for ML pipelines, but we don't want to breach the realm of something like dplyr (dataframe manipulation), do we? This seems like something that should be separate of an ML framework.

But if I understand you correctly you just meant dplyr as a metaphor for something like that in the ML context, right? If so, sure!

We should sketch out some pseudo code for it, though. I looked at the stream macro of OnlineStats and I think it looks pretty cool. To me a ML pipeline would deal with things like

encoding some data,
applying PCA,
training a couple different models in parallel,
plotting their ROC curves.

I would already be happy if this would just be a few lines of very readable code and the parallelism would only require standard language constructs. Julia is already a beautiful and concise language, so we shouldn't use a DLS by default for the low-level interface.

But as @tbreloff said before, if we settle on the verbs/design, it should be easy do provide convenience macros to chain operations ala OnlineStats stream macro

Evizero commented 8 years ago

I cleaned up the code a bit. I removed the noise for now.

Essentially the package now contains utilities to deal with encoding target vectors. I also included a decorator (similar to what DataFrames do) that can be used for classifier to automatically deal with class encodings.

For example: if you write a RegressionModel that only works for y in {-1,1} then all you need to do is derive from EncodedRegressionModel{SignedClassEncoding} instead. This will box the model into a decorator that overloads fit and predict accordingly and allows the model to accept target vectors of other types such as strings while the model itself only deals with -1 and 1.

thoughts?

datnamer commented 8 years ago

@tbreloff Do you plan to work on the data manipulation infrastructure as well, separately or part of this effort? I mean things like abstract nullable tables, nullable sqlite backed dataframe etc

Many little pieces of next gen libs have been popping up (like faster streaming csv parser etc), built still think the data manipulation infrastructure needs a big push and unification. Dataframes is still on the old data array for example.

I think that needs the most work and might take the most work.

Evizero commented 8 years ago

outsourced