JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 157 forks source link

Related efforts in the Julia ecosystem. PP, autoML, formulae, visualization, and others. #47

Open datnamer opened 5 years ago

datnamer commented 5 years ago

Very exciting to learn about this effort! A julia native ML package improving on Sklearn is one of the key missing pieces of the ecosystem.

Here's a list of ideas I'd like to bring to your attention, if you haven't considered them already. Some would be very long term projects, that I hope to help with, if they are even within scope. I can open issues for any that deserve their own.

  1. There's already been a relatively developed (yet stalled) effort for something along these lines in the JuliaML ecosystem. Might want to consider integration or lifting of ideas: https://github.com/JuliaML

  2. Integration with prob programming framework like turing.jl would be really cool. They have non gradient samplers that can work with arbitrary julia code, along with HMC. Would be cool if point parameters and priors can be mixed in a model (or model search as prob program induction) with different sampling/optimization strategies. https://github.com/TuringLang/Turing.jl cc @yebai

  3. Regarding architecture search and "automl", here are some python exemplars: https://github.com/automl/auto-sklearn, https://github.com/jhfjhfj1/autokeras, https://github.com/EpistasisLab/tpot. tpot can optimize over non differentiable pipelines using genetic programming.

  4. Tables.jl is an alternative table interface which much of the stats ecosystem is coordinating around by @quinnj which can also hook into interable tables, though IIRC there are some issues with missing data interop.

  5. One of those with integration is statsmodels.jl, which has a very powerful formula interface that works with abstract tables. Would be cool (and an improvement over sklearn) to integrate with this.

  6. I'm looking to work on some graph NN stuff, so would be great to have support for non euclidean input data ala https://github.com/rusty1s/pytorch_geometric . This is both bring NNs to graphs but also bringing graphs to NN as useful "inductive biases : https://arxiv.org/abs/1806.01261

  7. Yellowbrick type Plots.jl or Makie recipes

tlienart commented 5 years ago

Anthony or Franz will probably give a more detailed and better answer but here's my take on your points from what I understand of the MLJ plan:

  1. unless I misunderstand it, JuliaML ecosystem does not provide something that offers composable models which is the core of the MLJ idea (another way to frame it is: sklearn pipelines but much better). However JuliaML provides tools that could be used in MLJ such as metrics which may replace some of the things that are in MLJ.
  2. I don't think there's a plan to integrate probabilistic programming in MLJ and I also don't think it would be a good idea imo given the philosophy is very different and the use case as well. AFAIK PPL are more in the realm of research whereas MLJ would hope to be a practical ml library that can be used for real applications and ideally work on heterogeneous/distributed architectures.But it could be good to interface with Turing
  3. Automatic HP tuning is definitely something that will be looked into afaik.
  4. I think that's already the plan
  5. An interface to statsmodel is probably a good idea
  6. cool!
  7. also a good idea though I think it's probably too early at this point

Ps: you may want to join the slack channel if you haven't already 😄

datnamer commented 5 years ago

Thanks for your reply. That all makes sense.

Re ppl, it would be nice to be able to wrap a turing.jl model for use in a pipeline without having to expose learnable parameters. Pymc3.has a scikitlearm wrapper to do that.

Also regarding Julia ml, it was initially meant as a framework for complex ml pipelines with rich abstractions and nested models as transformation . Here is some info to that effect : https://github.com/JuliaML/LearnBase.jl/blob/master/src/LearnBase.jl

And @evizero can say more

datnamer commented 5 years ago

Here's the right link https://github.com/JuliaML/Transformations.jl and http://www.breloff.com/transformations/

fkiraly commented 5 years ago

I think I more or less agree with @tlienart , some other comments:

  1. probabilistic programming is essentially the modern, object oriented formulation of Bayesian MCMC (sometimes variational too). As such, it falls into the category of a method class, within the Bayesian framework. Bayesians are, in contemporary practice, not very clear on the separation of fitting versus model application (e.g., prediction) vs evaluation that is reflected in sklearn's fit/predict design, while the PP world is extremely advanced in modular/abstract model building.

Where the task is prediction, the two can be made to fit together, see e.g., the discussion in section 1.3.4 of https://arxiv.org/abs/1801.00753 for more details on my opinion on the issue, though it is not immediate as the sklearn an common Bayesian modelling interface designs are differently focused (on model class specific model building). I'd see https://github.com/alan-turing-institute/skpro as a design study in how an interface could look like, it is conditional on a probabilistic learning interface which sklearn does not have but MLJ now does :-)

I agree with @tlienart that this is an exciting research area (with direct real world applications, e.g., finance or medicine) - though I disagree with the conclusion, I think that makes it especially worthwhile to work on! Though maybe not as one of the current MLJ priorities for pragmatic reasons.

  1. We've discussed abstract model specification and model description in #22 - I think the high-tech solution would synergize really well with a formula specification interface such as in statsmodels or R. However, again it might not be a priority given limited resources...

Generally, we're always happy for suggestions/designs/contributions!

The Turing is also always looking for competent people, please consider formally applying here: https://www.turing.ac.uk/work-turing/research-associates https://www.turing.ac.uk/work-turing/research-assistants (part time arrangements are possible)

fkiraly commented 5 years ago

Renamed the topic to be more descriptive of its content.

ablaom commented 5 years ago

Many thanks @datnamer for your comments and enthusiasm! Just a few things to add to the other comprehensive responses:

  1. MLJ already has a flexible API for building learning networks beyond simple linear pipelines, for exporting them as stand-alone models, and tuning their nested parameters. While "architecture" search should be possible, the immediate priority would to improve usability of the existing interface, for example, by providing standard architectures (linear pipelines, stacks, etc) out-of-the-box, and to add to the existing tuning strategies (to include AD gradient descent for pure julia models).

  2. MLJ is indeed attempting to be "data agnostic" and there are two generic tabular data interfaces we have looked at: the Tables.jl interface you refer to, and the Query.jl iterable tablets interface (defined in TableTraits.jl). @ayush1999 and I have played around with these but it is still not absolutely clear to me which is this best. At the moment we are using iterable tables, although we currently have a small intermediate interface in MLJBase that could allow us to change our minds later. An important requirement is integration with CategoricalArrays.jl; some other requirements have been discussed here, another here. What we have now works but could be improved.

  3. Will have another look at Statsmodels.jl

There are only a few models for which the MLJ interface has been implemented, and a priority is to implement existing the MLJ framework for new models. Any help in this area is particularly appreciated.

quinnj commented 5 years ago

On point 4., note that the Tables.jl interface is a superset of the iterable tables set of sources/sinks. I.e. any iterable table is also a Tables.jl-compatible source. This change was made to help simplify the ecosystem and allow for package developers (and use-cases exactly like this) to only need to rely on Tables.jl and get everything else for free. Happy to help answer any other questions regarding using Tables.jl.

ablaom commented 5 years ago

Thanks. To clarify, every object X for which TableTraits.iterabletable(X) is true, also implements the Tables.jl interface? But surely not all of it, as column access is not universally adopted by iterable tables, as I understand it. Do you have a link to a relevant discussion?

datnamer commented 5 years ago

Thanks for the feedback everyone, the project seems quite well thought out. Perhaps a blogpost will help garner cooperation from the broader Julia community.

Glad to know many of these things are or have been considered.

@fkiraly , regarding PPL integration, I did mean primarily for prediction and in that vein the reference you linked seems quite intriguing.

@ablaom When you reconsider statsmodels, please do note that a monster PR by @kleinschmidt is pending which will provide far more expressiveness and composability to the modeling interface.

quinnj commented 5 years ago

@ablaom, let me clarify. Every object X that implements/satisfies TableTraits.isiterabletable(X), also automatically implements/satisfies the Tables.jl interface. This is because Tables.jl checks if an object first implements Tables.jl itself; if not, it checks if the object satisfies TableTraits.isiterabletable and if so, it knows how to provide the Tables.jl interface for those objects. Tables.jl contains fallbacks that ensures that any object that is a "table" can be accessed by both Tables.rows and Tables.columns, allowing users/package developers using Tables.jl to use Tables.rows or Tables.columns as is most convenient for their package, without needing to worry about whether the input table will support one or the other. Hopefully that helps? Let me know if you have any other questions.

ablaom commented 5 years ago

@quinn Thanks indeed for those details. After taking another look, I have switched MLJ from Query to Tables, which is serving our needs well for now.

DrChainsaw commented 4 years ago

A little late to the party, but here is a related effort w.r.t point 3: https://github.com/DrChainsaw/NaiveGAflux.jl

I'm currently just using it to play around with very loosely restricted search spaces so there is perhaps not much to offer yet in terms of simplify tuning as you'll most likely end up tuning the tuning parameters :)

If someone has a favourite search space and think it would be useful to add an MLJTuning plugin for it I could probably make an effort to make a package as long as it doesn't require me to shift through python code of the same type as what is used in the examples in the OP.

ablaom commented 4 years ago

@DrChainsaw Looks like your doing pretty cool things here, and would be great to get some integration at some point. We have a basic implementation of MLJ's model interface for Flux - which we are polishing at the moment. Consider it just a POC for now (https://github.com/alan-turing-institute/MLJFlux.jl).

If someone has a favourite search space

Well, I guess this is the million dollar question. I'm no expert and would be happy to hear your own thoughts. Do you think, if you know, that what auto-keras and tpot do is worth emulating?

DrChainsaw commented 4 years ago

Thanks @ablaom

When I started I was perhaps naively thinking that the search space should not be the million dollar question if you just had a framework which allowed for arbitrary mutation of the network architecture. Turns out things might be just a little bit more complicated than that, but I still haven’t satisfied my curiosity on the subject. Once I do (or maybe before) I will certainly look into integration. From what I have seen, it looks like MLJ has answers for a lot of the API questions I have not yet wanted to address.

I understand the neural network stuff in MLJ is in a quite early phase, but do you forsee that packages of Flux based tuning methods should depend on MLJFlux or would they depend on the more basic packages if they want to integrate with the MLJ APIs?

I’m certainly no expert either and I have not used topt or autokeras myself. I think I have seen people claim some success with them. From just browsing the code on github it was not super easy to find out what the default search space looks like.

Reimplementing an existing and well tested method for NAS is probably a safe choice though. I’m not looking into doing this at the moment but if someone who reads this would like to tackle it I’m happy to help make use of the NaiveXX packages. If the method relies on modification of existing network architectures one should be able to save a lot of effort that by using them and I dare to say that they are quite well tested.

ablaom commented 4 years ago

I understand the neural network stuff in MLJ is in a quite early phase, but do you forsee that packages of Flux based tuning methods should depend on MLJFlux or would they depend on the more basic packages if they want to integrate with the MLJ APIs?

You would depend on MLJFlux. The idea of this package is simply to encapsulate any given set of instructions for building a supervised Flux neural network into an MLJ model (which is just a struct storing hyperparameters). In that way all the MLJ meta-algorithms (evaluation and hyperparameter optimisation) can be applied. However, competing with the flexibility implied by this remit is a desire to make deep learning models look and feel more like traditional models, to make them accessible to users outside of NLP and images. There is some friction here because the different communities value different things, but we are going to have a stab at something and see what people think.