Machine Learning Roadmap

lindahua commented 10 years ago

Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.

Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:

Generalized Linear Models
- Linear Regression
- Logistic Regression
- Lasso, Elastic Net, and its variants
- Stochastic Gradient Descent
Current efforts: GLMNet, GLM, Regression
Support Vector Machines

Current efforts: SVM, LIBSVM
DimensionalityReduction
- PCA
- ICA
- CCA
- Linear Discriminant Analysis
- Kernel-based methods
Current efforts: DimensionalityReduction
Non-negative Matrix Factorization

This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.
Classification

There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.
Clustering

Current efforts: Clustering.jl
Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase
Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)

This is a huge field in itself, and may be discussed separately.

cc: @johnmyleswhite @dmbates @simonster @ViralBShah

Edit

I created an NMF.jl package, which is dedicated to non-negative matrix factorization.

Also, a detailed plan for DimensionalityReduction is outlined here.

lindahua commented 9 years ago

@Rory-Finnegan Classical machine learning techniques are still very important and widely used in many situations. What I was saying is that the focus of machine learning application has shifted gradually towards systems that integrate multiple components, and deep models are a case in point.

Hence, a sustainable strategy going forward should be to develop components such that they can nicely interact with each other through carefully crafted interfaces. As far as I can see, graphical models and neural networks are two most popular frameworks that allow one to put together a number of different components in a way that is mathematically valid.

amueller commented 9 years ago

@Rory-Finnegan The combined models in neural nets are usually trained in their entirety using backpropagation. The scikit-learn pipeline does not make any assumptions about differentiability, and is not able to backpropagate gradients. So I would see that as the major difference. I agree about your point about machine learning outside of deep learning. I am pretty sure it will not and should not should go away anytime soon.

SimonAB commented 9 years ago

@amueller as someone starting to apply ML in the biomedical field I must point out the growing importance of inference in supervised learning (say, ensemble methods with decision trees for deep sequencing)... so I would disagree with the statement that ML outside of deep learning will or should go away, unless deep learning preserves full visibility and interpretability of feature importances. My anecdotal evidence suggests there is much excitement and democratisation of ML approaches beyond the core data-science crowd. I see this as an important strategic opportunity for Julia.

And with that in mind, I fully agree with @lindahua that it is extremely desirable to build in the ability to integrate multiple ML components from the outset, if nothing else to 'future proof' Julia's ML ecosystem.

amueller commented 9 years ago

My statement was missing a "not" whoops. I totally agree with you.

ViralBShah commented 9 years ago

It does make sense to think through the APIs and composability. There are plenty of implementations in R and python that we can wrap to start with, eventually replacing them with Julia implementations. Has anyone used @svs14 's Orchestra, which seems to have wrappers for scikits-learn and caret?

https://github.com/svs14/Orchestra.jl

svs14 commented 9 years ago

I have used Orchestra ;)

I have not touched this for 3 months along with everything open source though due to some legal ambiguities in my previous internship, will work on it again from next week - so it's not dead.

For my purposes, I'm not concerned whether a machine learner is developed in a specific language, as long as I can compare + compose it together in a larger system I'm happy. If I truly need performance then I can gracefully degrade (in terms of effort of creating something from scratch compared to using a library) to write the respective learner in Julia.

Speaking along these lines, if it's not already considered, I would argue that pre-processing transformers such as near zero variance filtering found in caret can be pretty handy if not equally important to the learner in use in terms of a composable API. It's also easier this way if you enjoy applying grid-search to the pipeline itself, including omission of pre-processors.

Hope this adds to the discussion.

ViralBShah commented 9 years ago

Would it make sense to separate out Orchestra's wrappers around Python and R libraries into separate packages? I suspect they may receive more attention and can become common components in other projects.

rofinn commented 9 years ago

+1 I completely missed Orchestra when I was scanning through existing packages :( On the same line of thinking would it also make sense to see if existing julia packages (like DecisionTrees.jl, SVM.jl, etc) could be updated to support Orchestra's API? @svs14 I'd be happy to help you work on this if you're interested.

ViralBShah commented 9 years ago

In fact it was @aviks who showed me this package a few days back.

svs14 commented 9 years ago

As a heads up, extracting Orchestra's scikit-learn + caret wrappers to packages that cover the full spectrum of each may require a design overhaul - right now the wrappers only target classifiers. Also the solutions I used were very specific to constraints I had at the time, I suspect there are better ways to wrap both libraries now (especially caret, I went through PyCall.jl to rpy2 as there was no direct functioning route back then). Orchestra's wrappers may be beneficial as some inspiration, to a wrapper library built from the ground up.

Orchestra's API is very unstable and immediately suited for only a limited sub-domain of machine learning. For instance I'm currently investigating/developing on handling the spectrum of learning settings including semi-supervised, transfer and multi-task learning which will not be backwards-compatible. As such, it'd probably be best not to have other packages depend on it.

I think there are a number of well thought out machine learning APIs targeting different priorities as evidenced in the excellent discussions within the Julia community, and through Python's scikit-learn, R's caret, Java's weka, and Go's golearn (I'd love to have wrapped all of these as it gives me their learners for free lol!). As long as the API is unstable, it'd probably be best for the API designer to consider inversion of control and build wrappers for each package, instead of having each package developer responsible for adhering to it. IMO, this makes it a lot easier to change the API at will without buy-in from dependents, along with not placing any unstable standards' effort on package developers.

@Rory-Finnegan , would be great to work with you! I really like your ideas in Learn.jl and this discussion, except I don't have your email/contact-details, you can ping me at svs14.41svs@gmail.com if you want.

IainNZ commented 9 years ago

Looking at MathProgBase.jl's design might be of interest too - its essentially a general interface for constrained optimization problems, with 11 registered packages using it and other experimental packages also using it as a way to plug in to the infrastructure. Its composable too: i.e. I can make a pseudo-solver that takes inputs via the interface, and then solves a series of subproblems through the interface.

datnamer commented 9 years ago

+1 on consideration for inference aside from deep learning- Important for growing computational political/social science field.

wildart commented 9 years ago

I do not think that there is a big problem in designing a comprehensive ML framework. It is mainly a lack of commitment from the Julia ML community, including myself, in pushing forward with an initial design. I believe ML framework would evolve along with the language, and whether it will look as Learn.jl or MathProgBase.jl or scikit-learn, it is really circumstantial. If we are going to wait for interfaces or traits implementation in Julia which would certainly enforce particular standard on a ML packages, it will only postpone development of a general ML framework. I believe that providing common thin type hierarchy, in a manner of Learn.jl or StatsBase.jl, is enough to start development of various libraries for particular implementation of ML algorithms (even multiple). After all, a correct implementation of an ML algorithm is a pure scientific endeavor. And there should be some packages with engineering thought behind, like MathProgBase or Orchestra, which would have wrapping and data pipelining (including utilities and supporting functionality) for implemented ML algorithms without any particular preference. I value such packages more that any state-of-the-art learning algorithm, because they provide more benefits for a larger community. Let's push for some initial draft of a common ML interface that everybody will start to adopt. I like Learn.jl interface as an umbrella interface. It is based on a well known separation of ML algorithms, which could be gradually extended in particular implementations of ML algorithms, and in their turn integrated into the umbrella interface if necessary.

ViralBShah commented 9 years ago

I don't know that this will necessarily help, but there is the possibility of a position at MIT to push forward on this, if someone were interested to take it up full time. There are also some funds at NumFocus for Julia development, and this would qualify - but that would be for a much shorter duration. Perhaps someone who is focussing on this exclusively can be an anchor around which everyone can contribute.

ViralBShah commented 9 years ago

I feel like JuliaOpt has taken this approach of nailing down the APIs and building a flexible composable infrastructure. Of course, we had @mlubin and @IainNZ who anchored that work and many others joined. We need the same here.

SimonAB commented 9 years ago

@ViralBShah I agree. A full time position on this at MIT seems ideal given the importance of the field... among many pluses, this should also ensure face-to-face interactions with Julia core developers when low level changes would benefit other computationally demanding fields (e.g. BioJulia @dcjones)

Sisyphuss commented 9 years ago

As a PhD student on statistics and machine learning, I'll keep a close eye on this issue, and am willing to contribute to it.

By the way, I think what we need is a Grammar of Machine Learning.

rofinn commented 9 years ago

I look forward to someone being able to work full time on this, since I can only spare a few hours a week. In the mean time, I'm working with @svs14 on Orchestra.jl and maybe merging the common structure into Learn.jl. After we have the ensemble stuff refactored and working, I'll talk to the Mocha.jl folks about how deep learning should work as they seem to have a pretty popular approach.

datnamer commented 9 years ago

Would this project include frequentist and bayesian inferential models as part of the hierarchy? Perhaps @dmbates , @scidom and maybe @Fonnesbeck can chime in.

rofinn commented 9 years ago

@datnamer I'm inclined to suggest that frequentist and bayesian inferential models might make more sense as part of the StatsBase.jl package. I may include some wrapping functionality so that arbitrary models could be used as well, so long as they support the approriate methods.

papamarkou commented 9 years ago

Hi @datname, sorry for the slow reply, have been abroad over the last two weeks. Not sure where we should get with this in the long run - there is already a placeholder in PGM.jl. I think for now it is better to let the inferential modelling frameworks mature at their own independent pace given that they are at an infancy level. We can discuss merging efforts in the future, on the basis of a broader and richer codebase.

datnamer commented 9 years ago

@Rory-Finnegan and @scidom - Makes sense

amueller commented 9 years ago

If you want to design a common interface, I think it is important to define the scope. I probably makes sense to leave graphical models, structured prediction and probabilistic programming out of scope. But there are many other cases apart from classification and regression. I'm not sure it makes sense to try to define a very strict interface, and starting with the actual algorithms as @wildart proposed might be more fruitful.

Just some API cases to consider:

dataframes vs matrices as inputs
multi-label, multi-output and multi-task prediction
algorithm evaluation in grid-search, interfaces for metrics (allow out-of bag scoring, allow ranking metrics for both classification and regression, ....)
semisupervised learning (how do you identify unlabeled points, now does cross-validation handle these)
missing value handling
categorical variable handling (hard with matrices, easier with dataframes)
recommendation system interfaces (the interface is quite different from classification / regression)
online and active learning interfaces
reinforcement learning (in scope?)
building pipelines (what are the interfaces here? Can we subsample data?)
Regularization path algorithms and their interaction with cross-validation and grid-searches
Last but not least: pipelines for online learning

This is part of the laundry list of API choices as well as unsolved / punted issues in scikit-learn ;)

lindahua commented 9 years ago

I think @ViralBShah's idea is important. Technically, there can be many approaches to make this successful. The real problem is that we lack a person that can dedicate to this and drive the progress for long enough.

svaksha commented 9 years ago

I would like to contribute too but the sheer number of packages scares me. It would be really nice if there was an experienced mentor willing to lead and guide the effort.

rofinn commented 9 years ago

@amueller I agree that it would help to define the scope of the API, before starting and that it would help to work with particular implementations in mind. However, I don't think all of those points need to be addressed immediately.

rofinn commented 9 years ago

Also, this thread is getting a little long now so I've opened a chat on Learn.jl for folks who are interested.

amueller commented 9 years ago

Yeah I agree that not all points need to be addressed immediately and maybe for the moment it is more important to actually get something going.

RaviMohan commented 9 years ago

Shouldn't Reinforcement Learning be part of the ML roadmap? (sorry if I'm missing something obvious Julia newbie here). Do game playing algorithms (like those based on MCTS) fit into "machine learning"?

I know a research team applying RL to gameplaying who'd appreciate a solid Julia RL library that can deal with large datasets (with SMDPs ,MAXQ etc and some other bits and pieces like a distributed MCTS implementation) ,and am thinking of building something for them (and learn Julia into the bargain), but if some one's already working on it, perhaps as part of this roadmap, then I can probably fork/contribute to that rather than start from scratch.

rinuboney commented 9 years ago

Hi all,

I'm interested in contributing to achieving this roadmap as part of JSoC 2015. I was wondering if anybody is willing to mentor me on this project. The deadline is June 1st. That's too soon and a quick response would be great. If anybody has the time and are willing to do the same then please contact me(rinuboney@gmail.com) asap. I know Julia and machine learning. I can do this.

ViralBShah commented 9 years ago

@rinuboney It would help to put together a concrete proposal. There is a lot of discussion here, and it would be great to take a chunk of this as a JSOC project. If you can put something together based on the discussion here - what you will work on in the next 3 months, it will be easier to find a mentor.

rinuboney commented 9 years ago

I'm working on the proposal. It is accessible here: https://docs.google.com/document/d/1UBhEOqU1MMsjxfItDrg_XRViFgkZVSSvYACis6hT37o/pub

ViralBShah commented 9 years ago

Ok. Please mail it to juliasoc@googlegroups.com when it is ready.

jiahao commented 9 years ago

@rinuboney Thanks for your interest.

Your proposal is currently quite vague. It would be better if you could identify specific examples of ML techniques that you would want to run and show that the code is duplicated or redundant or has a complicated dependency stack. Let's say I'm interested in random forests - are there multiple current implementations? are the implementations too hard to use? or maybe not general purpose?

rinuboney commented 9 years ago

@jiahao Thanks for the feedback.

There are multiple implementations for various ML models. Eg : RandomForests - https://github.com/bicycle1885/RandomForests.jl, https://github.com/bensadeghi/DecisionTree.jl, GLM - GLMNet, GLM, Regression etc. I believe that it's a good thing to have multiple implementations but the problem arises when one wants to try out different models. Machine learning is about experimentation. When faced with a classification or regression or clustering problem, the user should try out different models and use the model that gives the best performance. When the models are implemented in separate packages, they operate on different data types and have different API. So the code has to be rewritten to try out each model. This is where a base library can help by providing interoperability between different implementations through an API. With an API, the user can switch between models and algorithms instantaneously. The same problems are present when a user wishes to stack models from different packages.

My proposal is more of coordination between the different machine learning packages scattered around so that they are can be used easily in a scikit-learn fashion. I thought that this is what the whole discussion was about.

I understand I should improve my proposal and I'm on it :+1:

IainNZ commented 9 years ago

@rinuboney make sure you look at https://github.com/Rory-Finnegan/Learn.jl

rinuboney commented 9 years ago

@IainNZ yeah my proposal is directly based on Learn.jl and this discussion.

RaviMohan commented 9 years ago

My (somewhat delayed) 2 cents. Rather than "unifying interfaces" etc, I'd rather have implementation of non existing functionality, which (imo) adds more value. A comprehensive test suite for package X would provide much better value imo.

The ML ecosystem is quite young in Julia, and providing an API to rapidly evolving libraries is a bit premature .In any case, the above list is hardly a real 'roadmap' ,it is just a (very comprehensive) list of ML topics - A 'roadmap' needs to have a sequencing of tasks and at least rough dates of completion to be meaningful.

What we need at this stage (again just my 2 cents feel free to ignore) are solid, tested, scalable libraries with compelling usecases that will get more people actively using Julia on real world ML projects. Interfaces can be extracted when mature libraries are aplenty and are really not very valuable till then The arguments for interfaces/coordination etc aren't very convincing (to me, YMMV)

All that said, if one really really really wants to work on interfaces, the first thing to do would be to build a compelling concrete case . Write an interface to a specific (and limited) set of libraries X,Y,Z... so we can do specific tasks A, B, C with 2 lines of code vs 20 (or whatever).

rinuboney commented 9 years ago

@RaviMohan Thanks for your feedback. I'm thinking more along this line. Consider an API is designed for the Julia ML packages. Then the various existing implementations in different packages could be unified. Existing libraries like scikit-learn, weka etc could also be wrapped in the API. Then the whole set of packages supporting the API can conceptually be used similar to a machine library like scikit-learn. I'll try to list some advantages to this approach:

The packages supporting the API can be maintained separately.
It takes less time than developing a single sold, tested and scalable ML library from scratch.
It provides a higher level of flexibility. Somebody can independently implement a new algorithm and comply with the API. Automatically, the new implementation is part of the Julia ML ecosystem which means it can be tested easily and added to existing projects without any hassle.
mix and match implementations in different packages.

Then once the API is designed, I think the community should focus on solid, tested and scalable implementations in Julia. I believe this part can be done faster in a decentralized manner unified by the API as opposed to a centralized single library approach. If the community takes the API road, then in the end, the community will have a plethora of packages accessible through a unified API. If it's the solid, tested and scalable library road then community will have a good ML library just like all the other languages. I don't believe in a single perfect library for any purpose. trade-offs have to be made in all cases and a unified API and separate packages allows you to change the trade-offs immediately.

RaviMohan commented 9 years ago

This whole idea of designing APIs first and then the implementations/wrappers rarely works in practice in ML (I'm cynical from decades of experience in navigating APIs designed before practical experience from actual library implementation)

(imo) people who don't do the implementation don't (generally) design good APIs and the above argument is too theoretical and lacking a real world perspective. Nobody, in any ecosystem has ever come up with one API that could wrap totally different libraries like scikit and weka and have it be useful in the real world. If you pull it off you'll be the pioneer.

That said, don't let me discourage you. If you think you can design an API for wildly different packages and/or packages not yet made, go for it. I'm skeptical about success,but don't let that affect your enthusiasm. Follow your vision.

rinuboney commented 9 years ago

Although the implementations are wildly different, they have the same functions and can be used with the same API. Not exactly the same but, In Clojure(programming language) there is a library called core.matrix. It is an API for working with matrices. Different implementations in native code,Java and Clojure support the API. Switching the implementation is trivial. I think it's possible to do something similar for ML packages. I know it's not possible to wrap up completely different libraries but a good number of them can be. Eg: scikit-learn and Go Learn have a similar API.

RaviMohan commented 9 years ago

"Not exactly the same"? core.matrix has nothing to do with a generic ML API. Manylanguages have interfaces of some kind or the other.

as you well know core.matrix is a completely different beast from an ML API. You can design wrappers for datastructures, with different level of abstraction and tradeoffs. We know this from the 70s!! By this logic every Java interface in existence is evidence that a generic ML API can be designed?

If you have a real world example of common APIs for massively different ML libraries (scikit and weka as per the OP) , I'm all ears. Else this is a case of "should work in theory but never has in practice" (imo).

I had a very talented friend try very hard for years to write a generic wrapper just for Reinforcement Learning libraries and in the end it was an excessively generic mess no one wanted to use. However that doesn't mean someone else might not succeed tomorrow. As I said above, if you think you can do it, go for it, and more power to you for trying.

To repeat, if someone thinks wrappers can be built that unify real world ML packages and "switch implementations trivially" ,that's great.

I am very skeptical about this actually working, having worked on real world ML projects for years, but that shouldn't affect anyone's enthusiasm for the idea. I just don't think it is a workable idea is all. I'll be glad to be proved wrong.

I hope you succeed. Cheers.

rinuboney commented 9 years ago

Well I'm a student and I'm still learning what's possible and not possible. I just happen to like the idea and I'm willing to work on it.

RaviMohan commented 9 years ago

Good for you.

We do need people to attempt the "impossible". That is how the world moves forward. You don't need anyone's approval to do what you want to do.

Go for it. Good Luck.

datnamer commented 9 years ago

Check out caret and mlr, "two attempts to create a unified framework across all types of algorithms for the various steps of machine learning in R (pre-processing data, training, testing, hyper-parameter optimization, etc.)." Sounds similar to @rinuboney 's stated goals.

https://github.com/topepo/caret https://github.com/berndbischl/mlr

amueller commented 9 years ago

I don't believe in a single perfect library for any purpose. trade-offs have to be made in all cases and a unified API and separate packages allows you to change the trade-offs immediately.

For scikit-learn, many of the trade-offs are in terms of API.

On the other hand, a unified API is what brings people to scikit-learn, even before we had as many [and as fast] algorithms as we have now. One of the reasons people convert from R to Python is that scikit-learn provides a unified and simple API.

rinuboney commented 9 years ago

They are similar. Haven't noticed it. I'll look into them in detail. @datnamer Thank you for pointing them out.

rinuboney commented 9 years ago

@amueller I'm a scikit-learn user and I really like the API. It makes ML really simple for beginners. I hope to make Julia ML packages accessible through a similar API.

IainNZ commented 9 years ago

I'd be pretty damn happy with a Julia caret

rinuboney commented 9 years ago

It would be awesome if I get a chance to work on it as part of JSoC. Please do have a look at my proposal. Any feedback would help me refine the ideas.

JuliaStats / Roadmap.jl

Machine Learning Roadmap #11