Overall roadmap - Githubissues

johnmyleswhite commented 10 years ago

This outlines a roadmap for basic statistical functionality that Julia needs to offer. It is heavily drawn from the table of contents for MASS.

[ ] Data processing DataFrames.jl
- reshape
- cast
- melt
- plyr
- ddply
[ ] Probability distributions Distributions.jl
- Univariate distributions
- Multivariate distributions
- Matrix distributions
[ ] Statistical graphics Gadfly.jl
- Render to native windows by default in REPL
- Add missing functionality found in ggplot2
[ ] Resampling Methods Resampling.jl
- Bootstrap
- Nonparametric
- Parametric
- Cross-validation
- Leave-one out
- k-Fold
- Random resampling
[ ] Linear regression GLM.jl
- OLS
- Robust regression QuantileRegression.jl
- ANOVA's
[ ] Generalized linear models GLM.jl
[ ] Nonlinear least squares LsqFit.jl
[ ] Tree-based methods
- Decision trees DecisionTree.jl
- Random forests
- Hierarchical clustering
[ ] Neural networks Mocha.jl
- Perceptrons
- 1-hidden layer
- Deep neural networks
- Convolutional neural networks
[ ] Generalized additive models
- gam
[ ] Random and mixed effect models MixedModels.jl
- lmer
[ ] Clustering Clustering.jl
- k-Means
- dp-Means
- DBSCAN
- Affinity propagation
[ ] Dimensionality reduction MultivariateStats.jl
- PCA
- MDS
- ICA
- NMF
- tSNE tSNE.jl
[ ] Factor analysis
[ ] Support vector machines
- LibSVM wrapper LIBSVM.jl
- Pure Julia SVM's SVM.jl, RegERMs.jl
[ ] Survival analysis
- Cox model
[ ] Time series analysis TimeSeries.jl
- Datetimes Datetime.jl
- AR(p)
- ARIMA
[ ] Spatial statistics
- Kriging
[ ] Optimization JuliaOpt
- Non-linear optimization Optim.jl, NLopt.jl
- Convex optimization Convex.jl
- Linear programming JuMP.jl
- Quadratic programming JuMP.jl
[ ] Database access
- ODBC ODBC.jl
- MySQL
- Postgres
- Sqlite SQLite.jl
[ ] Text mining TextAnalysis.jl
[ ] MCMC Lora.jl, Mamba.jl, Jags.jl, Stan.jl
- Gibbs sampling Lora.jl, Jags.jl
- Slice sampling Lora.jl, Mamba.jl
- HMC Lora.jl, Stan.jl
- NUTS Lora.jl, Mamba.jl, Stan.jl
[ ] Regularized regression GLMNet.jl, Lasso.jl, LARS.jl
- Grouped LASSO
- Streaming methods
- Parallel methods
[ ] Bayesian non-parametrics
- DPMM
- Pitman-Yor process
- HDP
- IBP
- HBP
[ ] Gaussian processes
[ ] Structured learning and prediction
[ ] Social network analysis Graphs.jl
[ ] R interface Rif.jl, RCall.jl
[ ] Generalized non-linear models

lindahua commented 10 years ago

This is a great list. Thanks for doing this, John.

From the perspective of my machine learning research, there are several major classes of things that remain needed:

Regularized regression: this include classical quadratic/L2-regularized linear/logistic/probit regression, LASSO, grouped LASSO, etc.

Despite being around for decades, the research in this area is still very active. New optimization methods to solve such problems are developed each year. Recent focus gradually switches to large-scale problem and the use of randomized methods, such as SGD and parallel methods.
Bayesian nonparametric models: nonparametric mixtures (e.g. DPMM, Pitman-Yor process, HDP, etc), nonparametric feature extraction (IBP, HBP, etc), and Gaussian processes.
Deep learning: needless to say, this is one of the most heated area in ML these days.
Structured Learning & Prediction: Structured SVM, Conditional Random field, etc.

johnmyleswhite commented 10 years ago

I wholeheartedly agree with all of those. What do you think of GLMNet for batch processing L1/L2 regularized regression? I'm inclined to suggest people use that as their canonical implementation unless they need to stream data (SGD) or work across machines.

lindahua commented 10 years ago

I agree with using GLMNet as default implementation for ordinary use.

iamed2 commented 10 years ago

Regarding Postgres: I had planned a libpq wrapper for Julia but I cannot get Julia+Clang+Clang.jl working in any of my Julia environments (multiple compounding issues with segfaults, Pkg fetch issues, compilation failures, linkage failures, etc.). Once those issues clear up I can start on libpq.jl.

After writing a wrapper for MATLAB through MEX, I'm excited to do it in a language with real capabilities.

diegozea commented 10 years ago

Maybe the list needs to include a safe way to use R packages inside Julia, like PyCall for Python. We have actually Rif.jl, which needs some job.

IainNZ commented 10 years ago

+1 to glmnet

For "convex optimization" we have Ipopt.jl, but we don't have a convex optimization modelling language ala CVX yet. That should follow nicely on from the automatic diff stuff, which is the main blocker right now.

nalimilan commented 10 years ago

I think you should add Text Mining, with a link to you TextAnalysis.jl package. :-)

Maybe Generalized Nonlinear Models too, though that's probably only used in a small community. Cf. e.g. http://cran.r-project.org/web/packages/gnm/index.html

johnmyleswhite commented 10 years ago

Thanks for all the good suggestions everyone. I'm going to revise the list now. One thing it's missing is MCMC support.

mschauer commented 10 years ago

A while ago I was wondering if it was worthwhile to have abstraction(s) for Stochastic processes similar to Distributions.jl which is an abstraction of a Distribution with means to sample from it. I think it is likely not a good idea to try to cover discrete time processes, continuous time processes and random fields with a single datatype, but for each of those this would be reasonable. I had a very preliminary look in https://github.com/mschauer/StoPro.jl/blob/master/src/StoPro.jl how adapting the Distribution.jl approach feels for continuous time.

lindahua commented 10 years ago

Unlike real-valued or vector-valued distributions, the samples of stochastic processes can exhibit a wide variety of forms. For example, a sample of a time series process is a sequence of values (whose length may or may not be fixed), a sample of a point process (e.g. Poisson process over arbitrary space) is a set of points, while a sample of a Gaussian process is a function. It is extremely difficult to subsume all these under the same umbrella.

I think it might be a better strategy to have multiple packages, each for a specific family of stochastic processes.

johnmyleswhite commented 10 years ago

While I agree with Dahua, I do believe that we can hope that the stochastic process types support many of the same operations: rand, pdf, etc.

mschauer commented 10 years ago

I was unsure about the right abstraction, exactly like you Dahua, John. I tend to group them according to the "time"/indexing variable. that might be too rough, but to get an idea, in the example I took two fairly different processes in continuous time (Poisson, Brownian motion) and looked, if there is any synergy effect: augment(P::VecProc, W :: GenVecProcPath, s) in the example takes either of those and augments retrospectively samples both processes on a finer grid. I agree that the functional view which is eminent for Gaussian processes is not easily incorporated.

lindahua commented 10 years ago

@johnmyleswhite When the covariates are finite and discrete (e.g. time series), it is feasible to do rand or pdf. However, there do exist a lot of stochastic processes, where the covariates are continuous (or very large). In such cases, people never do rand or pdf -- it is simply impossible to represent even a single sample (e.g. Gaussian processes and Dirichlet process).

As you know, these stochastic processes are widely used in practice, but in a way that is very different from how people typically work with ordinary distributions.

That being said, I think it is possible to group many stochastic processes for time-series in the same module (as they share similar operations). I would imagine that we will eventually have a system of packages for different kinds of stochastic processes, e.g.

TimeSeriesProcesses
PointProcesses
GaussianProcesses
DirichletProcesses (actually there is a larger family of processes that are similar to DP, e.g. Pitman-Yor or normalized random measure, etc)

mschauer commented 10 years ago

Aren't you too pessimistic? Especially for processes with Markov property and tractable transition distribution one has a meaningful concept of sampling the process and a need to describe families of those transition distributions in a consistent way.

lindahua commented 10 years ago

@mschauer For processes with a time covariate, Markov property and transition distribution definitely make sense.

But there are still a lot of processes (in a general sense) that do not even have such notions. For example, a Gaussian processes over an N-D space, I don't think notions like Markov property & transition distribution would even apply.

For such processes, people typically do not want to draw an entire sample from them. Instead, they want to draw a finite part of a sample (the entire sample itself is infinite, and therefore it cannot be represented by a computer).

johnmyleswhite commented 10 years ago

FWIW, the Gaussian process case is, in some respects, like my interest in a discriminative model type: you want a rand function, but the function needs to predicated on information: for something like regression, you need covariates, for a GP you need the grid of points over which you want a sample returned.

lindahua commented 10 years ago

John, you are right that we still want to do sampling & likelihood evaluation in some way. Therefore, sampling and likelihood evaluation functions are still needed for most stochastic processes. However, the interface of such functions for different kinds of stochastic processes is probably very different.

What I suggest is that the development can be done in several packages (instead of one), e.g. one for Markov processes, one for GPs, and another for DPs. etc. Since the way we work with the stochastic processes that fall into these different categories are vastly different, I don't think it is necessary to set up a uniform interface on the outset.

If some common things that come up during the evolution of these packages that we think are useful across facilities, we can always refactor things at that point.

johnmyleswhite commented 10 years ago

I think we're in complete agreement. Best to find patterns after code gets written than speculate a priori.

mschauer commented 10 years ago

Oh, there is no fault in coordinating these kind of things. So I would like to keep in mind, that there are plenty of connections, so for example a Data type for discrete observations of continuous time processes will be of interest and it is good to have a place for coordination of those of us who interested, so thank you John.

jiahao commented 10 years ago

I'd like to throw in RandomMatrices as well. It's self-contained, but better interop with statistics functionality would be useful and cut down on redundancy.

carljv commented 10 years ago

This is a great list. I think maintaining a well thought-out roadmap is really helpful for both current and potential contributors. Would it make sense to break at least some of these off into separate roadmaps, though? As is, each package seems to thin-featured---but this list is going to get unwieldy fast if you start adding features and details. It might also help congeal some more specialized working groups around package roadmaps. (E.g., I don't know what half the things under Bayesian Nonparametrics mean, but your listings under Time Series and Survival just make me sad :)).

johnmyleswhite commented 10 years ago

I definitely think we should have separate issues that expand each subtree of this roadmap. I have a few other issues as well that I'll be posting, including a request for standardization of keyword arguments across the whole ecosystem (e.g. every function that has a max number of iterations should call that keyword maxiter).

diegozea commented 10 years ago

http://sumsar.net/blog/2014/01/bayesian-first-aid/

johnmyleswhite commented 10 years ago

That's cool work. I'm not sure I think we should prioritize it just yet, since it's not a conventional approach yet.

jdtuck commented 9 years ago

One thing to add to the list would be functional data analysis similar to the fda and fdasrvf packages in R. I am the author of the R package and would love to help port it over.

johnmyleswhite commented 9 years ago

That would be great. I might wait until 0.4 is released so that you don't have to redo things to cope with the NA -> Nullable transition.

pluskid commented 9 years ago

For the "Neural networks" section: I'm recently writing a neural network package for julia: Mocha (this name because it is deeply inspired by the very popular C++ deep learning framework caffe). Currently I already have a working CUDA backend with an example of a deep convolution network with the LeNet architecture on MNIST. I will try to submit to the package index once I finish the CPU backend and add proper documents.

lindahua commented 9 years ago

@pluskid That's pretty cool.

I am actually considering just porting Caffe to Julia. Now my entire research group relies on Caffe to do things, and it is tempting to port it.

Is Mocha something new inspired by Caffe, or a port of Caffe itself. Does it support the Caffe model file format?

pluskid commented 9 years ago

@lindahua Thanks! Mocha looks like Caffe (general architecture, layers, blobs, solvers, etc.) but they are not completely the same. For example, Mocha uses HDF5/JLD to store snapshot of model, while Caffe uses Google Protocol Buffer (I believe). So it is not directly compatible, but in principle one could write a tool to import caffe models as I just find out that we already have ProtoBuf.jl.

I'm not sure I understanding what you mean by "porting". Re-writing caffe in julia in a compatible way might be quite costly to keep up with the changes as caffe itself is being actively developed. Creating a julia binding for caffe might be more do-able. The benefits is that existing work on caffe could be easily adapted, but the interface might be less flexible and less julia.

lindahua commented 9 years ago

I mean just a Julia binding to Caffe.

Vgrunert commented 8 years ago

Are there any new developments concerning the topics of the roadmap? I would like to make a case for developing non - and semi-parametric regression models.

johnmyleswhite commented 8 years ago

I don't think there's anyone right now who has time to manage this roadmap. I'm just working on nullable stuff for now since that still needs a lot of work.

datnamer commented 8 years ago

Hi @johnmyleswhite. Do we have an approx time line on the new nullable datatable / df replacement?

No rush, just trying to figure out when/what logistics for a project. Thanks

papamarkou commented 8 years ago

@johnmyleswhite, this issue just came to my attention. Great high-level organization, thanks for doing this.

I put all my Julia time, effort and focus on Lora, because my own research is related to MCMC methodology to a great extent, plus because I am very excited about this topic (plus because I will be writing a book on Monte Carlo methods with Julia, including Lora as a sub-part of it).

This Lora effort has been going on for about 6 months in one of the Lora dev branches and is neatly organized via issues and milestones. My timeline goes quite well with regards to having the first draft of a major upgrade by December, even earlier perhaps.

As for the rest of great topics/packages you outlined, I don't have the bandwidth to get involved (Lora is usually taking up part of my sleep to find enough time for its development).

JuliaStats / Roadmap.jl

Overall roadmap #1