Verbs, revisited... and much more

tbreloff commented 8 years ago

This has been discussed repeatedly, but it's important to get right if we want widespread adoption. Some references:

https://github.com/Evizero/MLModels.jl/issues/12 https://github.com/Evizero/MLModels.jl/issues/3 https://github.com/JuliaOpt/Optim.jl/pull/87 https://github.com/JuliaStats/Roadmap.jl/issues/15 https://github.com/JuliaStats/Roadmap.jl/issues/4 https://github.com/JuliaStats/Roadmap.jl/issues/20

(there are more linked in those issues, and I'm sure I missed a bunch of good conversations)

I recommend a quick skim over those discussions before commenting, if you can find the time.

What are we supporting?

It's important to remember all the various things we'd like to support with the core abstractions, so we can evaluate when a concept applies and when it doesn't:

Static transformations: log, exp, logit, ...
Aggregations: mean, variance, extrema...
Learnable transformations: regressions, neural nets, decision trees, ...
Compression and dimensionality reduction: PCA, ...
Generative models: distributions, stochastic variables, ...

And there are some opposing perspectives within these classes:

Bayesian vs Frequentist
Batch vs Online
Models producing distributions vs point estimates or classifications

All verbs need not be implemented by all transformations, but when there's potential for overlap, we should do our best to generalize.

Take in inputs, produce outputs

The generalization here is that the object knows how to produce y in y = f(x). This could be the logit function, or a previously fitted linear regression, or a decision tree. Options:

transform
~~predict~~ (taken by StatsBase)
~~map~~ (taken by Base)
apply (deprecated in Base... similar to call)
evaluate
classify (too specific)

I continue to be a fan of transform, with the caveat that we may wish to have the shorthand such that anything that can transform can be called as a functor.

Generate/draw from a generative model

rand
sample
simulate
draw
generate

I think using Base.rand here is generally going to be fine, so I don't think we need this as one of our core verbs.

Use data to change the parameters of a model

learn
~~fit~~ taken by StatsBase
train
update
solve
optimize

I've started leaning towards learn, partially for the symmetry with LearnBase, but also because it is not so actively used in either stats (fit) or ML (train), and so could be argued it's more general.

I think solve/optimize should be reserved for higher-level optimization algorithms, and update could be reserved for lower-level model updating.

Types

I personally feel everything should be a Transformation, though I can see the argument that aggregations, distributions and others don't belong. A mean is a function, but really it's a CenterTransformation that uses a "mean function" to transform data.

Can a transformation take zero inputs? If that's the case, then I could argue a generative model might take zero inputs and generate an output, transforming nothing into something.

If we think of "directed graphs of transformations", then I want to be able to connect a Normal distribution into that graph... we just have the flexibility that the Normal distribution can be a "source" in the same way the input data is a "source".

With this analysis, AbstractTransformation is the core type, and we should make every attempt to avoid new types until we require them to solve a conflict.

Introspection/Traits

There are many things that we could query regarding attributes of our transformations:

does it take input data, or is it a source (i.e. a generative process)?
is it invertible?
can we take a derivative/gradient?
is there a proximal operation? (this is not my strong suit!)
can it be learned?

I would like to see these things eventually implemented as traits, but in the meantime we'll need methods to ask these questions.

Package Layout

I think we agree that LearnBase will contain the core abstractions... enough that someone can create new models/transformations/solvers without importing lots of concrete implementations of things they don't need.

We need homes for concrete implementations of:

ModelLoss (MLModels.jl)
ParameterLoss (MLModels.jl)
StaticTransformation (MLModels.jl and others)
LearnableTransformation (MLModels.jl and others)
Solvers/updaters (StochasticOptimization and DeterministicOptimization?)
StatsBase and existing abstractions

StatsBase contains a ton of assorted methods, types, and algorithms. StatsBase is too big for it to be a dependency of LearnBase (IMO), and LearnBase is too new to expect that StatsBase would depend on it. So I think we should have a package which depends on both LearnBase and StatsBase, and "links" the abstractions together when it's possible/feasible. In some cases this might be as easy as defining things like:

StatsBase.fit!(t::AbstractTransformation, args...; kw...) = LearnBase.learn!(t, args...; kw...)

What are the other packages that we should consider linking with?

cc: @Evizero @ahwillia @joshday @cstjean @andreasnoack @cmcbride @StefanKarpinski @ninjin @simonbyrne @pluskid

(If I forgot to cc someone that you think should be involved, please cc them yourself)

cmcbride commented 8 years ago

Learn.jl sounds good. So does JuliaML.jl as it maps to this group and the tools in the associated ecosystem. I also like that 'ML' was dropped from the other package names.

tbreloff commented 8 years ago

Guys... I have a preliminary version of the JuliaML website up. The deployment is only slightly convoluted. I changed the base branch of JuliaML/JuliaML.github.io to be dev, and then I do a "git subtree push" to make the master branch look like it's a static website. The deploy instructions can be found here, though I can do the building as it's needed for the time being.

tbreloff commented 8 years ago

Learn.jl, ObjectiveFunctions.jl, and Transformations.jl are created.

rofinn commented 8 years ago

I'm not sure what would be most useful for me to contribute to, but I guess a few comments after reading through the discussion are.

I'd like the API defined in LearnBase.jl to remain as minimal as possible since we'll probably have an annoying transition period anyways ie: AbstractTypes -> Protocols
I like the idea of using terms like fit! and transform (and I guess also generate) rather than learn! and predict as they better align with the existing wording in StatsBase and I see them as more general terms. Multiple dispatch should be our friend here.
I'm not sure how this fits, but I've grown to really like how the recent refactoring of Boltzmann.jl allows you to build composable learning algorithms by just using a dict (or context) for passing hyperparameters around. The context can include callback functions for calculating penaties, gradients, etc.

ahwillia commented 8 years ago

I like the idea of using terms like fit! and transform (and I guess also generate) rather than learn! and predict

+1 to this. I think we will continually revisit this topic -- we have a ways to go before this ecosystem becomes mature and we are locked in to the verbs and types.

As a guiding philosophy I would really like to (a) sync up with JuliaStats as much as possible and (b) try to make each package as "standalone" as possible.

The caveat to part (b) is that all packages should depend on LearnBase and play nice with each other. But I think we should try to minimize the number of packages we have and (even more importantly) minimize the depth of the package dependency tree (edit: directed acyclic graph :).

ahwillia commented 8 years ago

P.S. Boltzmann.jl looks very good. We should include @dfdx in these discussions if he is interested!

datnamer commented 8 years ago

@Rory-Finnegan are protocols a Julia 2.0, as with traits or a 1.0 feature?

tbreloff commented 8 years ago

Obviously we've been through this, but it's worth reminding. We can't use 'fit' and 'predict' without forcing a dependency between LearnBase and StatsBase, which we don't want to do. I think 'train' is "too ML", and 'learn' hits a sweet spot between disciplines, plus has nice symmetry with LearnBase.jl/Learn.jl.

I want to have nice parallels with JuliaStats and play nice, and part of playing nice is not using their names.

On Thursday, June 30, 2016, Alex Williams notifications@github.com wrote:

P.S. Boltzmann.jl https://github.com/dfdx/Boltzmann.jl looks very good. We should include @dfdx https://github.com/dfdx in these discussions if he is interested!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229782444, or mute the thread https://github.com/notifications/unsubscribe/AA492thfpFw8IELesXXJdTNaMNq2kVDhks5qRCnugaJpZM4I_eeP .

rofinn commented 8 years ago

@datnamer my understanding is that interfaces/protocols/traits might make it into 1.0, but that isn't confirmed yet.

@tbreloff I don't think depending on StatsBase would be that bad given that most ML work is built off of stats anyways. However, this might be an argument to work with the JuliaStats community to refactor StatsBase into separate smaller packages.

ahwillia commented 8 years ago

without forcing a dependency between LearnBase and StatsBase

I have come to believe that this should exist. Again, I think something to revisit.

tbreloff commented 8 years ago

Alex: which way would the dependency go and why?

Until there's a real, compelling, functional reason for the dependency, we shouldn't do it. StatsBase is huge, and more specialized. If you think StatsBase should be reorganized, the way to do is it to build a lightweight LearnBase and show how and why StatsBase should be refactored to use LearnBase's abstractions, or some merged version of LearnBase with a slimmed down StatsBase. This won't happen through discussion. We have to build code that is independent, smart, and powerful, and later merge it in just the right way. If StatsBase was sufficient as is, we wouldn't be having this conversation.

On Thursday, June 30, 2016, Alex Williams <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

without forcing a dependency between LearnBase and StatsBase

I have come to believe that this should exist. Again, I think something to revisit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229794055, or mute the thread https://github.com/notifications/unsubscribe/AA492oMduipuGI5ioCi_TxFRQVoOI1Pkks5qRDTAgaJpZM4I_eeP .

ahwillia commented 8 years ago

When I say "revisit" -- I mean revisit down the line. For now, lets see how far we can get by ourselves.

But I'm already finding myself wanting to import a lot of bits and pieces from JuliaStats for the Transformations package.

import StatsBase: logistic
immutable LogisticTransformation <: Transformation end
transform(::LogisticTransformation, x) = logistic(x)

import Distributions: Poisson

"""
Transform a non-negative variable, x, into poisson random variable with mean x.

x -> ξ,  ξ ∼ Poisson(x)
"""
immutable PoissonTransformation <: Transformation end
Base.rand{T<:NonNeg}(::PoissonTransformation, x::T) = rand(Poisson(x))
transform(::PoissonTransformation, x::T) = Poisson(x)

How does this sit with you? This doesn't mean LearnBase needs to depend on StatsBase, but I think it makes sense for a lot of other packages to import functionality like this.

tbreloff commented 8 years ago

Ok this is a good point. I'm very confident that I don't want LearnBase to depend on StatsBase, but that doesn't preclude us from adding dependencies in other packages. If Transformations is better with a StatsBase dependency then that's great. But we shouldn't add dependencies to anything lightly. It might be that the better solution is a link package that converts Distributions into Transformations automatically.

On Thursday, June 30, 2016, Alex Williams notifications@github.com wrote:

When I say "revisit" -- I mean revisit down the line. For now, lets see how far we can get by ourselves.

But I'm already finding myself wanting to import a lot of bits and pieces from JuliaStats for the Transformations package.

import StatsBase: logisticimmutable LogisticTransformation <: Transformation endtransform(::LogisticTransformation, x) = logistic(x)

import Distributions: Poisson """Transform a non-negative variable, x, into poisson random variable with mean x.x -> ξ, ξ ∼ Poisson(x)"""immutable PoissonTransformation <: Transformation endBase.rand{T<:NonNeg}(::PoissonTransformation, x::T) = rand(Poisson(x))transform(::PoissonTransformation, x::T) = Poisson(x)

How does this sit with you? This doesn't mean LearnBase needs to depend on StatsBase, but I think it makes sense for a lot of other packages to import functionality like this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229806902, or mute the thread https://github.com/notifications/unsubscribe/AA492hkrhV0nAr0ZrweCvA_O4wE70EbZks5qREMrgaJpZM4I_eeP .

dfdx commented 8 years ago

which way would the dependency go and why?

As far as I understand, it's all about who owns names. Let's take fit as an example. If it were defined in Julia's Base, we would probably not have this discussion, we would simply extend it - JuliaStats and JuliaML in there own way, meaning essentially the same thing - fitting a model to data. Right?

But since Julia is a general purpose language, fit doesn't sound to be a good function name to go to the Base. Yet, it might be a good idea to extract names into a 3rd package (say, StatsNames.jl or StatsModelFunctions.jl or whatever) and make both - StatsBase and LearnBase to depend on it. This would make a dependency extremely lightweight, at the same time allowing JuliaML and JuliaStats to develop in their own way as long as needed.

Regarding type hierarchies and related things, my personal opinion is that we should:

make interface as simple and as high-level as possible to make it easy to implement for new package authors;
concentrate on functions, not types.

The first point is quite self-descriptive. I should only note that after reading this discussion I find how much I would need to patch my existing packages to conform to interfaces described. For example, for restricted Boltzmann machines I need quite a lot of refactoring to implement AbstractLoss, and for naive Bayes (which is not a gradient-based model) it doesn't sound reasonable at all. Although I will try my best to follow common approaches, I believe it's unfair to ask other people to adapt to spend their time when their package is already designed and works well for their own needs.

Second point may sound irrational given excellent type-based method dispatching system provided by the Julia language. But in my experiments I found that 99% of time is spend on array operations, while static vs. runtime method dispatching gives improvement on the level of statistical error. At the same time, functions (i.e. the "verbs") show exactly what we want to achieve: we want to learn/fit a model to data, not to get a LearnableTransformation, to predict new data and not just get a LinearPredictor, transform data and not get some AbstractTransformation, to sample / get random sample from a generative model, not really carrying about RandomVariables. So if the only reason to introduce a new type (e.g. AbstractTransformation) is to improve performance by at most 1%, I would try to avoid this new type as much as possible.

tbreloff commented 8 years ago

As far as I understand, it's all about who owns names.

I agree... it's a very annoying problem that we can't just magically add to a global fit method, and that we have to both agree to import the same method.

it might be a good idea to extract names into a 3rd package

I think there would be a lot of push back on this idea, at least until JuliaML is more mature and JuliaStats people have reason to be motivated to do a refactor. Also, I actually prefer transform and learn to predict and fit... I think they're more general and expressive.

concentrate on functions, not types.

Yes we've been preaching this continually. Whatever the solutions, I want to make sure that we can implement a few "query functions" (like is_invertible, etc) for an external library and it will allow that library to "hook into" the LearnBase abstractions. I feel very strongly that we don't want to make existing packages re-design their codebase to fit our ideas... if that's required then we've failed.

This is the approach I've taken with Plots, and it's worked out very well. Build something that is my ideal vision of what people should use, and provide "link code" to "connect" existing solutions to my vision. It seems like extra work, but in fact it makes life very easy, because you don't have to fight people to switch to your way of thinking (but they will eventually ;)

for restricted Boltzmann machines I need quite a lot of refactoring to implement AbstractLoss

Really? I'd love to talk this through on another thread... it might be easier than expected, and if not you can help guide the abstractions by showing where they are lacking.

So if the only reason to introduce a new type (e.g. AbstractTransformation) is to improve performance by at most 1%, I would try to avoid this new type as much as possible.

Actually this stuff has absolutely nothing to do with performance... it has to do with being able to connect very different approaches with the same language, allowing me to, for example, add a RBM node as part of a larger graph of ensemble-like meta-models. I could care less about saving a few CPU cycles... I care about combining disparate model types in a super easy, modular way, and solving complex problems with ease.

ahwillia commented 8 years ago

Yet, it might be a good idea to extract names into a 3rd package (say, StatsNames.jl or StatsModelFunctions.jl or whatever)

+1 to this. Though I also understand why there would be pushback.

This would make a dependency extremely lightweight.

How heavy is the StatsBase dependency really? We seem to be operating under the assumption that it is very bad. I defer to others judgement on this -- but I would like to understand better. In the REPL, using StatsBase loads nearly instantaneously for me. Is the problem that it pollutes the namespace?

Presumably, JuliaStats wants StatsBase to be lightweight. If it is not lightweight and we can convince them of this, then it is possible we could make a lightweight package that lives above everything -- StatsNames.jl or something similar.

cstjean commented 8 years ago

Re. fit! coming from StatsBase or LearnBase... ScikitLearnBase uses fit! and predict, but it doesn't import StatsBase, so they are separate functions. It makes sense conceptually - for example, GaussianMixtures.jl implements both ScikitLearnBase.fit! and StatsBase.fit!, and they have different calling conventions. It's a little annoying because there are some ambiguity errors when using both, but it's not too bad overall. Either learn! or train! is a good choice for LearnBase IMO.

Evizero commented 8 years ago

It is not just about access to the name fit though. it is also an implicit contract to use it in the same way. Otherwise our package could not be considered high quality content. In StatsBase the idea of using it to fit a model is with fit{T}(modeltype::Type{T}, ...) which I think is not a good choice for us

tbreloff commented 8 years ago

it is also an implicit contract to use it in the same way.

This is a really good point. We are building a different verb, and it's appropriate to have a different name to avoid confusion.

It's a little annoying because there are some ambiguity errors when using both

To me, ambiguity warnings are more than a little annoying... they completely break the package. Here's an example of what I consider breaking. I'll call using on two modules exporting the same method. In the first, I've called (and thus compiled) A.f, and so after using B it calls A.f. In the second example I get an error. This is broken IMO.

julia> module A
       f() = "A"
       export f
       end
WARNING: replacing module A
A

julia> module B
       f() = "B"
       export f
       end
B

julia> using A

julia> f()
"A"

julia> using B
WARNING: using B.f in module Main conflicts with an existing identifier.

julia> f()
"A"

julia> 
tom@tom-office-ubuntu:~/.julia/v0.4/Plots$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.4 (2016-03-15 15:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> module A
       f() = "A"
       export f
       end
A

julia> module B
       f() = "B"
       export f
       end
B

julia> using A

julia> using B

julia> f()
WARNING: both B and A export "f"; uses of it in module Main must be qualified
ERROR: UndefVarError: f not defined

ahwillia commented 8 years ago

@tbreloff - I agree we shouldn't use fit, etc. unless it is imported from StatsBase

it is also an implicit contract to use it in the same way.

This is a really good point. We are building a different verb, and it's appropriate to have a different name to avoid confusion.

These differences still aren't 100% clear to me. Is it differences in semantic meaning or in the parameters of the function call?

dfdx commented 8 years ago

Also, I actually prefer transform and learn to predict and fit... I think they're more general and expressive.

Actually, I think we need both - predict and transform. For example, linear discriminant analysis may use tranform to map data to a new space and predict to classify a new point to one of classes. The same thing applies to conditional RBMs and most likely to a number of other models.

Also, transform is already defined in MultivariateStats, and I won't be surprised to see learn and train in some other popular packages. So I don't really think we will be able to get around name conflicts without synchronizing with JuliaStats (either by depending on them or moving the names somewhere else).

Really? I'd love to talk this through on another thread... it might be easier than expected, and if not you can help guide the abstractions by showing where they are lacking.

It's mostly about changes in architecture, e.g. in many parts of gradient and loss calculation I heavily rely on a context holding previous results, cached arrays and simply parameters values. It's still trivial to implement high-level methods (like fit or transform), but the more abstractions are involved, the harder it becomes to follow a guideline.

I care about combining disparate model types in a super easy, modular way, and solving complex problems with ease.

Then it's not strictly necessary to inherit from the same types. E.g. if I want to make a pipeline that accepts N distinct models, trains them, transforms data through the first (N-1) and predicts using the last model, then all I need are methods train, transform and predict defined on corresponding models. The models themselves may or may not inherit from LearningAlgorithm, StaticTransformation, AbstractOptimizer, etc. - with or without these types the code will work as long as appropriate methods are defined.

In StatsBase the idea of using it to fit a model is with fit{T}(modeltype::Type{T}, ...) which I think is not a good choice for us

Note that they also have:

fit(obj::StatsBase.StatisticalModel, data...)

while for fit! it's the only available method.

Since I haven't been closely following JuliaML discussions recently, I need to ask possibly stupid question: where's the boundary between JuliaML and JuliaStats. I clearly see that they have different "centroids", but where's decision boundary that tells where I should put a new package? If there's no clear boundary, maybe it makes sense to actually reuse StatsBase names to provide consistent (or at least similar) interface?

tbreloff commented 8 years ago

@dfdx you make some great points. I want to quickly respond to this:

where's the boundary between JuliaML and JuliaStats

Eventually the boundary will disappear (I hope!). The difference is that JuliaStats is the established organization with a big existing userbase. It's not practical to "tinker" with the abstractions and implementations of something that many would expect to be fixed and robust. Everyone in this discussion agrees that we want to improve/change the tools for "learning from data", and this organization is our place to experiment and tinker, and hopefully come up with a unified framework that encompasses all that JuliaStats does well plus much, much more.

With that said, a fundamental reason that I don't want to import abstractions and methods from StatsBase is because I want a clean slate. I don't want to be tricked into using sub-par abstractions, just because "there's already an implementation of that, lets just use it". If that ends up being true, then great, but I want total flexibility in design at the beginning.

tbreloff commented 8 years ago

All: I created a chat room for JuliaML here: https://gitter.im/JuliaML/chat Sometimes gitter is nicer for "conversations" than issues.

Evizero commented 8 years ago

A little update concerning recent developments in JuliaML.

LearnBase.jl is now registered in METADATA. We decided to extend StatsBase for now.
Losses.jl will soon be ready. @ahwillia put quite a lot of effort into it recently. We just have to iron out some test issues and then make a PR to METADATA.
Penalties.jl is making fast progress thanks to @joshday
MLMetrics.jl I took over maintenance of our metrics package and started a big refactor to improve performance. At least the classification metrics should soon be in a very good and performant state; that is binary and multiclass (not multilabel yet). To that end I also implemented averaging modes similar to scitkit-learn's metrics.
MLDataUtils.jl already has some tools for lazy data splitting, but it will further improve with the new array views in 0.5

All in all the above packages will provide us with a lot of generic utilities to base more complex and focused functionality on. I am very happy with the progress we are making. Thanks to everyone who is participating

JuliaML / META