Machine Learning Roadmap

lindahua commented 10 years ago

Currently, the development of machine learning tools are in several different packages without little coordination. Consequently, some efforts are repetitive, while some important aspects remain lacking.

Hopefully, we may coordinate our efforts through this issue. Below, I try to outline a tentative roadmap:

Generalized Linear Models
- Linear Regression
- Logistic Regression
- Lasso, Elastic Net, and its variants
- Stochastic Gradient Descent
Current efforts: GLMNet, GLM, Regression
Support Vector Machines

Current efforts: SVM, LIBSVM
DimensionalityReduction
- PCA
- ICA
- CCA
- Linear Discriminant Analysis
- Kernel-based methods
Current efforts: DimensionalityReduction
Non-negative Matrix Factorization

This may be categorized into dimensionality reduction. However, NNMF in itself has a plethora of methodologies, and thus deserves a separate package.
Classification

There are many techniques for classification. It may be useful to have multiple packages respective techniques (e.g. GLM, SVM, kNN), and have a meta-package Classification.jl to incorporate them all.
Clustering

Current efforts: Clustering.jl
Many machine learning applications also require some supporting functionalities, such as performance evaluation, data preprocessing, etc. These can all go into MLBase
Probabilistic Modeling (e.g. Bayesian Network, Markov Random Field, etc)

This is a huge field in itself, and may be discussed separately.

cc: @johnmyleswhite @dmbates @simonster @ViralBShah

Edit

I created an NMF.jl package, which is dedicated to non-negative matrix factorization.

Also, a detailed plan for DimensionalityReduction is outlined here.

johnmyleswhite commented 10 years ago

I agree with all of this. I've got a lot of prototype SGD code already.

I like the idea of meta-packages. If we're going to have Classification.jl, maybe Regression.jl should be a similar meta-package?

jiahao commented 10 years ago

I'm not an expert in this area, but I've been interested for awhile and am willing to help.

lindahua commented 10 years ago

@johnmyleswhite: Will you please move Clustering, SVM, and DimensionalityReduction over to JuliaStats? These are very basic for machine learning. I recently get some time to work on those.

For regression, when there are several quite different techniques implemented, it will make sense to make a meta package.

johnmyleswhite commented 10 years ago

I transferred Clustering and SVM over. I'm going to announce that I'm moving DimensionalityReduction over, then we can go ahead and make the move tomorrow.

lindahua commented 10 years ago

Also, I think it is important to separate packages that provide core algorithms and those integrated with DataFrames.

We may consider to provide tools such that they can be worked nicely with machine learning algorithms. However, I think core machine learning packages should not depend on DataFrames -- which are not used as frequently in machine learning.

johnmyleswhite commented 10 years ago

I agree completely. I would very strongly prefer that we implement integration with DataFrames in the following way throughout all packages:

Packages should always define algorithms that operate on Vector{Float64} and Matrix{Float64}.
DataFrames.jl exposes a set of tools via formulas that translate between DataFrame and Matrix{Float64}.

This makes it easy to work with pure numerical data without any dependencies on DataFrames, while making it easy for people working with DataFrames to take advantage of the core ML algorithms by efficiently translating DataFrames into matrices.

johnmyleswhite commented 10 years ago

The only hiccup with what I just described is deciding whether the interfaces that mix DataFrames + ML should live. Arguably there should be one big package that does all of this by wrapping the other ML packages with a DataFrames interface.

lindahua commented 10 years ago

@johnmyleswhite are there issues of providing these in DataFrames.jl ?

johnmyleswhite commented 10 years ago

Providing what?

lindahua commented 10 years ago

Sorry, I seemed to misread part of your comments. I agree with your suggestions.

lindahua commented 10 years ago

Just that I am not sure whether we really another meta-package to couple DataFrames and ML, if the tools provided in DataFrames are convenient enough.

johnmyleswhite commented 10 years ago

You're right: we could encourage users to explicitly call the DataFrame -> Matrix conversion routines. That would simplify things considerably.

johnmyleswhite commented 10 years ago

The two main difficulties with this approach:

Getting the community to adopt this kind of strategy consistently.
Dealing with packages that legitimately need additional information to do their work. In GLM, for example, the entire model estimation steps need nothing more than access to the design matrix. But presenting the results in a convenient way requires access to the information about the original coefficient labels.

lindahua commented 10 years ago

For GLM, my consideration is to have two packages:

A package that provides the core algorithms that only work with numerical arrays.
A higher-level package that builds on top of the core package that provides more friendly interface. (This package may depend on DataFrames)

lindahua commented 10 years ago

So this is basically your idea of having a higher-level package that relies on core ML packages + DataFrames to provide useful tools for analyzing data frames.

IainNZ commented 10 years ago

On phone right now, but weren't there some CART/Random Forest packages if not in METADATA then just mentioned in mailing list? One thing about those is that they can use factors quite well, so I imagine would be directly dependent on DataFrames as that is the package-of-choice for representing that kind of data. So when talking about best practices etc. it might be worth keeping in mind that some packages might really be most efficiently made on top of DatFrames instead of the Matrix{Float64} abstraction

lindahua commented 10 years ago

Decision trees, by their nature, can work on heterogeneous data (each observation may be composed of variables of different kinds). For such methods, implementation based on DataFrames makes sense. I don't mind a decision tree package depending on DataFrames.jl

There do exist a large number of machine learning methods (e.g. PCA, SVM, LASSO, K-means, etc) that are designed to work with real vectors/matrices. Heterogeneous data need to be converted to numerical arrays before such methods can apply. Packages that provide such methodologies are encouraged to be independent of DataFrames.

johnmyleswhite commented 10 years ago

You're right: there's a DecisionTree package.

To me, working with factors is actually a really strong argument for pushing a representation of categorical data into an earlier layer of our infrastructure like StatsBase. But we're actively debating ways to do this in JuliaStats/DataArrays.jl/issues/73.

If we could avoid some of the issues @simonster raised in his issue, I think it would be a big help to move the representation of categorical data closer to Julia's Base.

Also worth keeping in mind that nominal data is often worked with using dummy variables, which do fit in the Matrix{Float64} abstraction. That's actually how GLM handles those kinds of variables.

If DecisionTree.jl needs DataFrames.jl, I fully agree with Dahua: that's not a problem. But if it only needs a simpler abstraction, pushing things towards that simpler abstraction seems desirable.

simonster commented 10 years ago

There are some cases where Matrix{Float64} is too specific an abstraction. I have experimented with fitting point process GLMs in Julia, where the design matrix is theoretically expressible as a Matrix{Float64}, but it would require a huge amount of memory (for my models, probably >100 GB). On the other hand, it is easy to express the design matrix as an AbstractMatrix{Float64} that efficiently implements A_mul_B! and At_mul_B!. I wrote code that does this and directly minimizes the negative log likelihood via L-BFGS using NLopt, which fits my model in a reasonable amount of time with reasonable memory requirements, but I'm not sure what to do with this code, since the GLM package is still about 3x faster with a Matrix{Float64} (for the benchmark included with the GLM package with the same convergence criterion, excluding the non-negligible time to construct the ModelFrame).

As far as the model fitting interface for DataFrames, it would be cool if we could get this to work on top of StatisticalModel. Packages could implement:

fit(::Type{MyModelType}, X::AbstractMatrix, y::AbstractVector, args...)

and DataFrames could implement:

function fit{T<:StatisticalModel}(::Type{T}, f::Formula, df::DataFrame, args...)
   mf = ModelFrame(f, df)
   DFStatisticalModel(mf, fit(T, ModelMatrix(mf).m, model_response(mf), args...)
end

or similar. DFStatisticalModel could provide a wrapper that maps between coefficients and their labels when calling coef, predict, etc. Of course, doing this right requires that we have a reasonable StatisticalModel interface (#4) so that we can make the relevant functionality accessible for DataFrames.

jiahao commented 10 years ago

There are some cases where Matrix{Float64} is too specific an abstraction.

This sounds a lot like the discussion we had in JuliaLang/IterativeSolvers.jl#2 a little while ago.

andreasnoack commented 10 years ago

@simonster GLM can use a sparse model model matrix, but I think you'll have to define your own subtype of LinPred.

ViralBShah commented 10 years ago

It would be great if as part of the roadmap, we can also plan to put together some large datasets in place, so that the community can work on optimizing performance and designing APIs accordingly. Having RDatasets is so useful, and something that makes large public datasets easily available for people to work with will greatly help this effort.

lindahua commented 10 years ago

@ViralBShah Good point. Datasets are important. I think we already have a MNIST package, we can definitely have more.

Just that we need to be cautious about the licenses that come with the datasets.

johnmyleswhite commented 10 years ago

There are surprisingly few large data sets that are publicly available. I'd guess that the easiest way to generate "large" data is to do n-grams on something like the 20 Newsgroup data set. Classifying one of the newsgroup against all the others is a simple enough binary classification problem that we can scale out to arbitrarily high size (in terms of features) by working with 2-grams, 3-grams, etc. Other useful examples might be processing the old Audioscrobbler data (http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html) or something similar.

ViralBShah commented 10 years ago

We also have CommonCrawl.jl. The point about the datasets is not as much to distribute them as julia packages, but to have easy APIs to access them, load them, and work with them. Often, I find that the pain of figuring out all the plumbing is enough to discourage people, and making the plumbing easy could get a lot more people to contribute.

ViralBShah commented 10 years ago

Perhaps not too big, but there's also the Netflix and MovieLens datasets - which could be made easier to access.

johnmyleswhite commented 10 years ago

The Netflix data set is illegal to distribute.

gibiansky commented 10 years ago

Question from an outsider - is there anything along the lines of Theano (from Python) in the works for Julia? Development of many deep learning models (RNNs) is sped up dramatically by AD-style software like Theano, and would allow the integration of deep learning into Julia much more easily...

johnmyleswhite commented 10 years ago

There are several AD tools in the works. Check METADATA for a few. There are also some GPU code-gen tools as well, including OpenCL.jl. Eventually it should be possible to combine those two into something like Theano.

ccsv commented 10 years ago

I would like to see something for association rule learning and neural networks

Other things needed should are gridsearch for finding hyperparameters (if not already implemented), Naive Bayes (with the +1 smoothing cases), and Restricted Boltzmann machine

IainNZ commented 10 years ago

Just pinged @benhamner, saw he listed another machine learning package and I'm hopeful for collaboration on shared interfaces

BigCrunsh commented 9 years ago

My feeling is that most of the packages are better suited in a separated JuliaML group; that would support the consistency. What is the disadvantage of having an own group?

IainNZ commented 9 years ago

No disadvantage, although I think they will be tightly linked if common functionality is re-used as much as possible. But the number of incompatible ML packages is starting to worry me...

BigCrunsh commented 9 years ago

Agree, me too... would be a chance to unify things.

lindahua commented 9 years ago

Lately, dimensionality reduction (MultivariateStats.jl), clustering (Clustering.jl), and nonnegative matrix factorization (NMF.jl) have been in a usable shape.

However, there is one big area which is still in a messy status, that is, Generalized Linear Models / Regression. There have been several packages alone this line, implementing more or less similar functionalities, but they do not work with each other.

I will open a new thread to discuss how we may proceed to unify the efforts in this domain.

rofinn commented 9 years ago

@lindahua I'm not sure what the status is on this topic anymore, but I noticed this thread while comparing MLBase.jl and MachineLearning.jl as utility libraries for experimenting with Boltzmann.jl. I see @benhamner hasn't responded to the issue from 6 months ago, but would it make sense to integrate the two packages together? I'm particularly interested in some kind of pipeline api like this in one of the base libraries.

johnmyleswhite commented 9 years ago

@Rory-Finnegan: Right now I feel like there's no one with the time to take control of this project and give it the direction it needs. If you're feeling like demoing something that out, that seems like a good idea to me.

benhamner commented 9 years ago

@Rory-Finnegan just saw this - had missed the original ping (my Github notifications are spammed by internal Kaggle repos).

At this point, MachineLearning.jl has been a small playground I've touched here and there on the side as time permitted.

As @johnmyleswhite said, "right now I feel like there's no one with the time to take control of this project and give it the direction it needs." Definitely applies to me as well (at least for the visibility I have over the next 3 weeks, and likely longer). I've not looked closely at MLBase.jl yet, but need to. If you want to step up, go for it!

amueller commented 9 years ago

I'd love to have a look, but my julia skills are a bit below par ;) I'll see if I can find the time.

ViralBShah commented 9 years ago

What can we do as part of a GSoC project this summer to make progress here. Can we pick a small set of things to target to go further from where we are. I suspect there are lots of potential contributors, but we need someone to build a bit more of a framework before others can jump in. Someone here mentoring a GSoC student could get quite a bit of work done.

Should we also think of wrapping existing R and python libraries, and get the APIs right to start with, and then piecemeal, replace the underlying implementations? This python ML document was trending on HN today:

https://docs.google.com/a/fourthlion.in/document/d/1YN6BVdReNAYc8B0fjQ84yzDflqmeEPj7S0Xc-9_26R0/preview?sle=true

rofinn commented 9 years ago

So to keep the ball rolling on this topic I've created another repo with a README summarizing what I'm thinking I'd like in this base library. https://github.com/Rory-Finnegan/Learn.jl If you have time please take a look and post feedback. I'll admit that I'm coming from a sklearn background.

swgregg commented 9 years ago

I'm new to Julia and starting to use some VLM and NN in my graduate research in mechanical engineering. I saw the reference to GSoC and am eager to potentially contribute to an ML or statistical learning project this summer. Anyone know if there will be a Julia project along these lines for GSoC?

johnmyleswhite commented 9 years ago

I suspect there's not going to be an ML GSoC project since there's no one who's got time to mentor a student.

johnmyleswhite commented 9 years ago

@Rory-Finnegan: I like your proposal a lot. I think the best thing you bring up is that we can use "interfaces" to solve the biggest blocker we came across earlier: the lack of a coherent hierarchy that we could place most models into.

lindahua commented 9 years ago

Sorry for being inactive for months.

Since I assumed the job of being an assistant professor last September, I feel that my life completely changes. I am now leading a group of PhD students and I find it is difficult to spare time to write codes myself. When I talked to a faculty member at Univ. of Toronto, I was told that "once become a faculty, fun life of coding is over" -- this is true, but a bit sad.

Another problem is that with the thriving of the entire deep learning business, many classical machine learning methods, like many we are discussing here, are quickly becoming irrelevant for the ML community. We should seriously reconsider the way going forward. I believe that logistic regression, linear regression, SVM classifiers, etc should no longer be considered as standalone procedures, instead they should be treated as building blocks to construct more sophisticated systems. Therefore, the interactions between them should be taken seriously.

swgregg commented 9 years ago

I see that Julia is no longer listed in the accepted organizations on GSoC (or maybe it was never listed this year and I was looking at last year previously). I may still try to get involved with GSoC, however, I'm interested in contributing to ML in Julia either way. My background is not CS and I have primarily just coded in Matlab for engineering projects. That said, I am working on school projects right now for an algorithms class and may try to implement some of the project work in Julia to get a feel for it.

In my masters research I plan to use ML techniques to help perform condition monitoring on either wind turbines or hydroelectric turbines (depends on funding). I'm working with Matlab now as I learn the statistics behind SVM and NN, however, I would like to use open source software for my research. I will be spending at least part of the summer developing my coding skills and Julia seems like a good fit with opportunities for contributing at a fundamental level while deepening my understanding of ML.

I realize I am an outsider and haven't had enough time to wet my feet in this community. That said, I am on spring break now and plan to start using Julia for a dynamic programming project due in a few weeks. If anyone out there has input on how I may be able to help this summer (preferably in ML/statistics and taking into account my skill set) please let me know. I was a senior controls engineer with 12 years experience before I decided to come back to school... so I don't necessarily need mentoring, just a point (or kick or shove) in the right direction and honest feedback.

ViralBShah commented 9 years ago

We will most likely have other ways to have the equivalent of GSoC funding this summer. We are working on this, and while nothing is firm, we will announce once ready.

Cc: @alanedelman

ViralBShah commented 9 years ago

@swgregg IMO, a practical way to start off with a github repo for the project you are working on, and file issues against relevant packages as you run into missing functionality, roadblocks, or design issues.

swgregg commented 9 years ago

Thank you for the advise. I will do just that.

Date: Sun, 8 Mar 2015 05:35:32 -0700 From: notifications@github.com To: Roadmap.jl@noreply.github.com CC: engineer_gregg@hotmail.com Subject: Re: [Roadmap.jl] Machine Learning Roadmap (#11)

@swgregg IMO, a practical way to start off with a github repo for the project you are working on, and file issues against relevant packages as you run into missing functionality, roadblocks, or design issues.

— Reply to this email directly or view it on GitHub.

rofinn commented 9 years ago

@lindahua So I definitely agree that part of the goal should be to standardize the interactions between models, since more and more people are stacking/combining different techniques (like using a dimensionality reduction alg in front of a classifier). The simplest way of dealing with this seems to be a model container type like Pipelines, in which the container is responsible for validating the interactions between models. If you have a better idea please let me know.

I'm not sure I agree that classic machine learning techniques are becoming irrelevant or that they shouldn't be used as standalone procedures. Especially since not everyone who uses these ML techniques are necessarily part of the ML community. Sometimes folks might just want to use a simple regression or SVM for their problems. Either way, I don't see any reason we can't support both approaches. :)

I'm still pretty new to deep learning so I could be missing an important step, but for deep belief networks, deep neural networks, etc couldn't we also just express them as a Pipeline? Specifically, that you'd build a deep learning architecture by subtyping Pipeline adding in whatever model units you want and then applying a fine tuning algorithm like backprop. We could probably also organize it so that Pipeline is a LearningModel, allowing for recursive nesting of Pipelines (once again, this would be made easier and cleaner with some kind of traits or interfaces system). Unfortunately, I'm not sure how this approach would relate to existing deep learning packages like Mocha.jl. Thoughts? FYI, I'm currently looking at deep learning architectures for my thesis, so I'll probably want to integrate my code with this package anyways.

If you're okay with that approach, I could probably find some time this week to flesh out the Pipeline stuff (or w/e we want to name the container type).

JuliaStats / Roadmap.jl

Machine Learning Roadmap #11