Best of breed - Githubissues

tbreloff commented 8 years ago

I'd like to maintain a "best of breed" list of the most mature/maintained packages to accomplish various tasks (decision trees, data handling, post-fit analysis, or whatever else). Ideally we could also produce a "workbench" type of package which re-exports lots of good packages, pylab-style. Someone could do:

Pkg.add("MLWorkbench")
using MLWorkbench

and have all the best vetted packages ready to go.

ViralBShah commented 8 years ago

Great idea.

tbreloff commented 8 years ago

For anyone following this repo... what are the must-have machine learning packages that you use? What are you working on?

Going back to this "MLWorkbench" idea... it would be cool if the design was somewhat similar to Plots. A clean and easy-to-use interface with lots of functionality, which then calls down to machine learning "backends" of every type. Similar to Gadfly, PyPlot, etc, all of the major players (MXNet, TensorFlow, ScikitLearn, etc) have overlapping functionality but with different strengths and weaknesses.

In (my) ideal world, LearnBase would drive the core abstractions, and we'd generate "link code" to connect those abstractions to the backends. I'd like to see major packages wrapped in such a way that we can easily create a "transformation pipeline" (as in https://github.com/Evizero/LearnBase.jl/issues/12#issuecomment-191255676) using components from disparate packages. How cool would it be to easily create ensembles and pipelines using random forests from ScikitLearn, ANNs from MXNet, etc, all without needing to learn each individual package.

Comments please!

ChrisRackauckas commented 7 years ago

I assume that this MLWorkbench idea has now changed names to Learn.jl.

I think what's key for it is to just have lots of functionality to an easy interface to switch algorithms. I don't tend to have a favorite method/package or tweak defaults (at least at first). My machine learning problems in the end boil down to optimizing some function which usually has a high cost (for example, the error in an average output from numerical SPDE solver where the SPDE is defined by some parameters, and those parameters are what I am learning. For reference, these SPDE solvers could take on the order of minutes/hour to evaluate, making each iteration take awhile). Since machine learning "isn't what I do" and is just one black-box in much larger projects, I want to be able to get as good of a result as possible out with the least amount of effort (low effort is really the key here). The way I usually go about it is through brute force: I try one algorithm for a little bit, look at the result, then another, look at the result, etc. and then stick with the one that's looking good.

So the most crucial feature to me is being able to "switch backends". It's like how I work with Plots.jl: for me plotting/graphics isn't "what I do", it's just a means to an end. So if PyPlot can't BigFloats that are smaller than 10^e-40, I just switch to Plotly, see that it works, and go "that'll do".

I'd like to be able to do the same thing with machine learning. Just kind of write a script that will, on different nodes of a cluster, call almost exactly the same code but just "switch backends" or just switch a keyword to try out 20 different algorithms, and use what the best result was (or ensemble the results). I hope something like that is possible with Learn.jl/MLWorkbench, where it gives an easy way to call things from ScikitLearn, TensorFlow, etc. Of course I won't know the details of any of these packages, and neither do I want to take the time to. I just want to do "backend = scikitlearn(), learn()... that was bad. backend = tensorflow(), learn(), ... cool, that looks good" and move on (of course, that's not real syntax but you get the point).

Currently, the probability of trying a ML/optimization package is proportional to the number of possibly good algorithms it has divided by the amount of time it would take to try it out.

tbreloff commented 7 years ago

@ChrisRackauckas this is very much my long-term goal. well said.

Evizero commented 7 years ago

I doubt we will be anywhere this high level in the near future. Lots of basics to get right first.

I think there is a common fallacy that a lot of people think there are reasonable constant defaults to hyper parameters of even simple algorithm, which I claim is untrue. even something simple like lambda for a ridge regression heavily depends on what your data looks like and if you did some normalization. while on the other hand switching to a different implementation of the same algorithms is pretty much a no-op concerning the output, assuming that the defaults are the same, which I am sure they aren't; ergo the different results. All in all this sounds like something that could promote ill practice if done in a bad way.

Evizero commented 7 years ago

I think we should rather focus on decent heuristics to derive default values based on the data, instead of switching backends. A metapackage like Plots.jl would make more sense to me as a separate high level package

Evizero commented 7 years ago

Re-reading the issue maybe I am missing the point of what Tom intends with the workbench. My recent impression of the goal of Learn.jl was that it should reexport the JuliaML packages as a Metapackage. This issue paints a different picture which seems independent of other JuliaML efforts.

To be clear. To me the recent activity in JuliaML clearly focuses on getting some low level functionality right, which is my main interest at the moment. I can see the use of a Plots.jl like package for ML in Julia, but that seems a bit unrelated to what we are currently doing.

tbreloff commented 7 years ago

I doubt we will be anywhere this high level in the near future. Lots of basics to get right first.

agreed

assuming that the defaults are the same, which I am sure they aren't

This is an important point. The Plots-model is consistency in spec/defaults, so that you define your problem from a high level, default hyperparameter values are filled in consistently, and then this problem formulation is translated to a backend calculator.

My recent impression of the goal of Learn.jl was that it should reexport the JuliaML packages as a Metapackage. ... I can see the use of a Plots.jl like package for ML in Julia, but that seems a bit unrelated to what we are currently doing.

Again... you nailed it. My short term focus is designing and building both a high level interface and a low level implementation that will support what I/we care about. So in the short term this means that Learn is just imports our own implementations of LearnBase abstractions. However in the long term, when the high-level is mature, I think we can add alternative "backends" which take a high level problem formulation and pass it off to another existing framework (like TensorFlow). This is not a short term goal, and is not my immediate focus, but it's in the back of my mind as a longer-term extension.

ChrisRackauckas commented 7 years ago

That's the thing: I don't really care about "ill practice" as much as results. I spent some time learning machine learning awhile ago and back then learning the different software and setting up my problem to work with them was interesting. Now, I just want good parameters for an SPDE, or the rates that fit best in a drug interaction model, etc. Whatever gives the best results for the least me-time put in is what I would prefer (note: this is very different from compute time. If I can throw the same code on a cluster and just switch one line to switch the algorithm that I am using, that is perfect). I think a lot of people who use machine learning but aren't actively researching the topic are the same way.

That means Learn.jl should have the right tools so I can "be stupid" correctly: good heuristics for default parameters, an easy way to "jiggle" those defaults (i.e. run the same algorithm with the defaults randomly changed around in some reasonable range and see what happens), and a good cross-validation readout to tell me how good a run did.

The long-term goal should be to have as many backends as possible: TensorFlow and all of that jazz that I hear about. But in the short term, being able to quickly access what JuliaML/JuliaOpt has (/will have) would be a worthy goal.

ahwillia commented 7 years ago

Didn't have time to carefully read everyone's comments so hopefully I'm not reiterating.

I just want to do "backend = scikitlearn(), learn()... that was bad. backend = tensorflow(), learn(), ... cool, that looks good" and move on.

I like this idea, but I think in practice it will be tricky. Unlike in plotting, where different backends have more or less the same goals, different machine learning backends (e.g. scikitlearn and tensorflow) are specialized for very different problems -- you can't fit a recurrent neural net with scikitlearn.

That being said I could see the backend idea working well for MCMC backends, e.g. Klara.jl (formerly Lora) vs Stan.jl. You could also switch between Tensorflow, Theano, and other deep learning frameworks -- see https://keras.io/ for a python library that does this.

It would be cool to unify backends for statistical modeling (scikitlearn), probabilistic programming (Stan, Klara), and deep learning all in the same ecosystem. But it will take a lot of thought and work.

But in the short term, being able to quickly access what JuliaML/JuliaOpt has (/will have) would be a worthy goal.

Yes, lets focus on this for now.

JuliaML / META

Best of breed #4