Verbs, revisited... and much more

tbreloff commented 8 years ago

This has been discussed repeatedly, but it's important to get right if we want widespread adoption. Some references:

https://github.com/Evizero/MLModels.jl/issues/12 https://github.com/Evizero/MLModels.jl/issues/3 https://github.com/JuliaOpt/Optim.jl/pull/87 https://github.com/JuliaStats/Roadmap.jl/issues/15 https://github.com/JuliaStats/Roadmap.jl/issues/4 https://github.com/JuliaStats/Roadmap.jl/issues/20

(there are more linked in those issues, and I'm sure I missed a bunch of good conversations)

I recommend a quick skim over those discussions before commenting, if you can find the time.

What are we supporting?

It's important to remember all the various things we'd like to support with the core abstractions, so we can evaluate when a concept applies and when it doesn't:

Static transformations: log, exp, logit, ...
Aggregations: mean, variance, extrema...
Learnable transformations: regressions, neural nets, decision trees, ...
Compression and dimensionality reduction: PCA, ...
Generative models: distributions, stochastic variables, ...

And there are some opposing perspectives within these classes:

Bayesian vs Frequentist
Batch vs Online
Models producing distributions vs point estimates or classifications

All verbs need not be implemented by all transformations, but when there's potential for overlap, we should do our best to generalize.

Take in inputs, produce outputs

The generalization here is that the object knows how to produce y in y = f(x). This could be the logit function, or a previously fitted linear regression, or a decision tree. Options:

transform
~~predict~~ (taken by StatsBase)
~~map~~ (taken by Base)
apply (deprecated in Base... similar to call)
evaluate
classify (too specific)

I continue to be a fan of transform, with the caveat that we may wish to have the shorthand such that anything that can transform can be called as a functor.

Generate/draw from a generative model

rand
sample
simulate
draw
generate

I think using Base.rand here is generally going to be fine, so I don't think we need this as one of our core verbs.

Use data to change the parameters of a model

learn
~~fit~~ taken by StatsBase
train
update
solve
optimize

I've started leaning towards learn, partially for the symmetry with LearnBase, but also because it is not so actively used in either stats (fit) or ML (train), and so could be argued it's more general.

I think solve/optimize should be reserved for higher-level optimization algorithms, and update could be reserved for lower-level model updating.

Types

I personally feel everything should be a Transformation, though I can see the argument that aggregations, distributions and others don't belong. A mean is a function, but really it's a CenterTransformation that uses a "mean function" to transform data.

Can a transformation take zero inputs? If that's the case, then I could argue a generative model might take zero inputs and generate an output, transforming nothing into something.

If we think of "directed graphs of transformations", then I want to be able to connect a Normal distribution into that graph... we just have the flexibility that the Normal distribution can be a "source" in the same way the input data is a "source".

With this analysis, AbstractTransformation is the core type, and we should make every attempt to avoid new types until we require them to solve a conflict.

Introspection/Traits

There are many things that we could query regarding attributes of our transformations:

does it take input data, or is it a source (i.e. a generative process)?
is it invertible?
can we take a derivative/gradient?
is there a proximal operation? (this is not my strong suit!)
can it be learned?

I would like to see these things eventually implemented as traits, but in the meantime we'll need methods to ask these questions.

Package Layout

I think we agree that LearnBase will contain the core abstractions... enough that someone can create new models/transformations/solvers without importing lots of concrete implementations of things they don't need.

We need homes for concrete implementations of:

ModelLoss (MLModels.jl)
ParameterLoss (MLModels.jl)
StaticTransformation (MLModels.jl and others)
LearnableTransformation (MLModels.jl and others)
Solvers/updaters (StochasticOptimization and DeterministicOptimization?)
StatsBase and existing abstractions

StatsBase contains a ton of assorted methods, types, and algorithms. StatsBase is too big for it to be a dependency of LearnBase (IMO), and LearnBase is too new to expect that StatsBase would depend on it. So I think we should have a package which depends on both LearnBase and StatsBase, and "links" the abstractions together when it's possible/feasible. In some cases this might be as easy as defining things like:

StatsBase.fit!(t::AbstractTransformation, args...; kw...) = LearnBase.learn!(t, args...; kw...)

What are the other packages that we should consider linking with?

cc: @Evizero @ahwillia @joshday @cstjean @andreasnoack @cmcbride @StefanKarpinski @ninjin @simonbyrne @pluskid

(If I forgot to cc someone that you think should be involved, please cc them yourself)

ahwillia commented 8 years ago

Thanks for consolidating the discussion here. Brief thoughts:

+1 for rand for sampling from a generative model.
I am fully supportive of using learn at this stage. As I've said elsewhere, I think we should push forward an hopefully converge and coordinate better with StatsBase at some stage.
Do we need StaticTransformation and LearnableTransformation to be a types? Why not query it as is_static(::AbstractTransformation) and is_learnable(::AbstractTransformation)?
As I've mentioned elsewhere, I think it would be nice to implement is_invertible(...) and get_inverse(...) for Transformations.

datnamer commented 8 years ago

cc: @jmxpearson and @dmbates

Also for the dag and parallelism: https://github.com/MikeInnes/Flow.jl, https://github.com/JuliaParallel/Dagger.jl

joshday commented 8 years ago

+1 to just about everything. Thanks for the nice summary. My minor comments:

I would vote against using solve. Mostly because it's for solving a linear system in R, but also because it sounds more general than optimize to me.
Some transformations can be done in place and I think we should add transform!. Maybe this would only be a method for StaticTransformations? Speaking of which, I think I'd rather add the abstract types than adding is_static / is_learnable methods.
The one naming convention I'm not sold on is ParameterLoss. I can't find the discussion, but what was the reasoning against Penalty? I don't think of them as loss functions, since if they are used by themselves the argmin is 0.
For OnlineStats and SparseRegression, I'd like a LearningAlgorithm abstract type so I can dispatch on different fitting algorithms for the same model. This doesn't need to live in JuliaML. Would anyone else find this useful?
I'm super excited about all of this.
I don't want to step on toes, but I'm also eager to help. Feel free to assign issues to me.

tbreloff commented 8 years ago

I was looking for the reasoning for ParameterLoss, and came across this: https://github.com/Evizero/MLModels.jl/issues/12. I'm adding it to the first comment, as the concept of "Transformation Pipelines" is integral to why I care about this stuff. I really want to have a uniform interface to every learning module so that I can build an algorithm to optimize evolution of a directed graph of static and learnable transformations.

tbreloff commented 8 years ago

I'd like a LearningAlgorithm abstract type so I can dispatch on different fitting algorithms for the same model

Can you describe exactly what you mean by learning algorithm? Examples? How does this overlap with what we're doing in StochasticOptimization.jl?

I'm super excited about all of this.

:+1: Me too

I don't want to step on toes, but I'm also eager to help. Feel free to assign issues to me.

No toe stepping possible... we're gonna need all the help we can get. Fork the repos, and submit PRs for what your working on. Lets all be ok with PRs getting changed or scrapped if the group doesn't agree with the direction.

ahwillia commented 8 years ago

Some transformations can be done in place and I think we should add transform!. Maybe this would only be a method for StaticTransformations? Speaking of which, I think I'd rather add the abstract types than adding is_static / is_learnable methods.

I don't have a particularly strong opinion. But would we also want InvertibleTransformation then? When is a feature implemented by a trait/query vs. something that merits a new type? My gut says to keep the number of types as low as possible until we develop further. But I trust @tbreloff and his pipeline vision on this :)

The one naming convention I'm not sold on is ParameterLoss. I can't find the discussion, but what was the reasoning against Penalty? I don't think of them as loss functions, since if they are used by themselves the argmin is 0.

I agree, but I could live with ParameterLoss for now at least.

For OnlineStats and SparseRegression, I'd like a LearningAlgorithm abstract type so I can dispatch on different fitting algorithms for the same model. This doesn't need to live in JuliaML. Would anyone else find this useful?

This is definitely necessary. We already have AbstractOptimizer, though LearningAlgorithm seems better if we are going to use learn as the verb to fit/train a model.

joshday commented 8 years ago

Is Penalty prohibitive to the idea of pipelines? @tbreloff How are you using ParameterLoss?
There are multiple algorithms to do say, Linear regression with LASSO. It be nice to have learn methods that allow using either proximal gradient method or coordinate descent. Whatever the syntax ends up being, I'd like to do

learn(LinearRegression(), L1ParameterLoss(), x, y, CoordinateDescent())
learn(LinearRegression(), L1ParameterLoss(), x, y, ProximalGradient())

When is something a trait vs. something that merits a new type? My gut says to keep the number of types as low as possible until we develop further.

Good question. Good comment. Maybe let's see where we get stuck first before adding abstract types.

tbreloff commented 8 years ago

I like LearningAlgorithm as well, assuming we're talking about the "outer loop" of a solver. Though I don't totally love ParameterUpdater, I feel like it describes it's purpose really well. A LearningAlgorithm knows how to iterate through data and, at each iteration, ask a ParameterUpdater to, well, update its parameters.

I think a LearningAlgorithm knows how to optimize a connected graph of transformations with one or more Loss functions at the output(s).

During each iteration, each ParameterUpdater uses the current graph state and an aggregated history of ParameterUpdaterState in order to update its parameters.

I think LearningAlgorithms could be stacked, for example if one algo specializes in a single epoch, and another manages many epochs.

jmxpearson commented 8 years ago

I'll reiterate something I said over lunch on Friday: I think what you ultimately want are traits, not types. It's simply too hard to subsume all statistical and ML modes in a coherent type system. Such a taxonomy doesn't exist now, and you'd be peering into a crystal ball for the future. It's inherently fragile.

On the other hand, it's not too hard to ask whether a model can be sampled from, fit or optimized, return it's parameters in some standard structure, make predictions, etc. This pattern has been used very successfully (IMO) in R for linear models, GLM's, and much more. Eventually, full trait support in Base may give us the ability to force this by contract, but in the meantime, if the interface is small enough, duck typing is fine.

The outcome I'd like to avoid is requiring high buy-in to a particular set of types. If JuliaML becomes a going concern, people may look to LearnBase for guidance in design, but if I'm a package author and the type system makes no sense in my problem domain, I am going to choose the best model for me and my users. The LearnBase interface will work, best, I think, if it is reasonably easy to _opt in to _post hoc* *.

On Mon, Jun 27, 2016 at 21:50 Alex Williams notifications@github.com wrote:

Some transformations can be done in place and I think we should add transform!. Maybe this would only be a method for StaticTransformations? Speaking of which, I think I'd rather add the abstract types than adding is_static / is_learnable methods.

I don't have a particularly strong opinion. But would we also want InvertibleTransformation then? When is something a trait vs. something that merits a new type? My gut says to keep the number of types as low as possible until we develop further. But I trust @tbreloff and his pipeline vision on this :)

The one naming convention I'm not sold on is ParameterLoss. I can't find the discussion, but what was the reasoning against Penalty? I don't think of them as loss functions, since if they are used by themselves the argmin is 0.

I agree, but I could live with ParameterLoss for now at least.

For OnlineStats and SparseRegression, I'd like a LearningAlgorithm abstract type so I can dispatch on different fitting algorithms for the same model. This doesn't need to live in JuliaML. Would anyone else find this useful?

This is definitely necessary. We already have AbstractOptimizer, though LearningAlgorithm seems better if we are going to use learn as the verb to fit/train a model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-228926824, or mute the thread https://github.com/notifications/unsubscribe/ADBM511qAM--Z0bB0KkiIs8bSkUBHFk6ks5qQH3rgaJpZM4I_eeP .

tbreloff commented 8 years ago

Is Penalty prohibitive to the idea of pipelines?

Penalty and ParameterLoss refer to the same thing. I like the symmetry that there are "loss components" and "loss functions". To be honest, I think Cost makes a little more sense, but I remember @Evizero giving some reason which convinced me to go with Loss (don't remember what). My goal is simply to keep simple themes and symmetry, so that someone in another field will still "get it".

The LearnBase interface will work, best, I think, if it is reasonably easy to _opt in to _post hoc* *.

This. I want absolutely minimal typing. I really want traits, and I think the next best thing is dispatching on "queries": is_invertible, is_learnable, etc.

joshday commented 8 years ago

My goal is simply to keep simple themes and symmetry, so that someone in another field will still "get it".

I think that's my issue with ParameterLoss. Many problems are constructed as f(param) + g(param) where f is "loss" and g is "penalty". Both are functions of the parameter and it may be unclear which is called ParameterLoss.

ahwillia commented 8 years ago

To further @joshday's point -- ModelLoss could also be misinterpreted as a loss or penalty on the model complexity, whereas what we mean is something like PredictiveLoss (i.e. a loss on model performance).

In general, I think Loss is understood to apply to performance while Penalty (or maybe Regularizer) is a term that penalizes model complexity. I think we should also consider adding Constraint. For example, non-negative matrix factorization might look like:

learn(QuadraticLoss(), NonNegativeConstraint(), data, (W, H), ALS())
learn(QuadraticLoss(), NonNegativeConstraint(), data, (W, H), SPA())

In the spirit of having as few types as possible, maybe we could start with just a single type -- AbstractLoss.

tbreloff commented 8 years ago

Great. I'm happy to attempt to subtype directly from AbstractLoss and see how far we can get. I also wonder if we can accept splatted losses in the 'learn!' method:

learn!(t::Transformation, input_data, losses::AbstractLoss...)

Is there a more general term than loss that would include predictive loss/cost, penalties/regularizers, and constraints?

On Tuesday, June 28, 2016, Alex Williams notifications@github.com wrote:

To further @joshday https://github.com/joshday's point -- ModelLoss could also be misinterpreted as a loss or penalty on the model complexity, whereas what we mean is something like PredictiveLoss (i.e. a loss on model performance).

In general, I think Loss is understood to apply to performance while Penalty (or maybe Regularizer) is a term that penalizes model complexity. I think we should also consider adding Constraint. For example, non-negative matrix factorization might look like:

learn(QuadraticLoss(), NonNegativeConstraint(), data, (W, H), ALS())learn(QuadraticLoss(), NonNegativeConstraint(), data, (W, H), SPA())

In the spirit of having as few types as possible, maybe we could start with just a single type -- AbstractLoss.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-228952848, or mute the thread https://github.com/notifications/unsubscribe/AA492nCH4PO3Quar4Q5sLg_lODyXh3iWks5qQLKvgaJpZM4I_eeP .

Evizero commented 8 years ago

Concerning naming of the losses. I started out with Loss and Penalty but we went into this direction because we then decided to go with descriptive names. I am open to changing it back.

We absolutely need a baseclass for algorithm. Right now I call it Optimizer in the placeholder code. I do not like the idea of making a ParameterUpdater the mos lowlevel thing, because not every algorithm lends itself to that dogma. Especially with many SVMs algorithm i do not want to split the algorithm up into several types and pieces for no good reason. I get that it might be convenient in cases such as NeuralNetworks, and I am not against it, but on a basic level I think an Optimizer that more or less is representative of the whole learning algorithm would be the most general approach.

There are (I think) important unsolved design decisions I would like to bring into the discussion concerning LearnableTransforms (or whatever they will be called) and mutating operations: How do we want them to look like?

Let us start with something simple, a simple linear predictor (lets call it LinearTransform) as it is used in rige regression or lasso etc. Is that a LearnableTransform? Intuition would say yes, but that would imply it needs the coefficients/parameters that are to learn in its structure (for it to make sense in transform(linear_transform, data)).

# just some pseudo definition to show what i mean
type LinearTransform{T} <: LearnableTransform
   parameters::T
end

So far so good, but would we want to return this to a user who want to fit a linear regression? I say no, because that user may expect more information, such as the data used to fit the model, maybe even p values. So should there be something like a Model for stuff like LinearRegressor or SVM? Probably not, because in a way those would also just be LearnableTransforms. LearnableTransforms containing more basic LearnableTransforms (i.e. somewhat reminding of a Decorator pattern)

type LinearRegressor <: LearnableTransform
    linear_transform::LinearTransform
    X_train::Matrix
    p_values::Vector
    # ...
end

I think this makes sense and is pretty extensible, but there are more questions still. How does one begin to learn such a thing? Does a LearnableTransform have to be preallocated before train?

lt = MyLearnableTransform()
train!(lt, ...)

This may work for a lot of cases, but be very dirty. Why dirty? Well if we consider LinearRegressor from above, we see that it contains the training data it was fit on. The only way this would work here if this member variable is abstract (aka the value is boxed) because we don't know the type of the training data when allocating the transform. By definition this would make our whole design slower than it needs to be and very volatile to type instability.

furthermore it would not be a good choice for SVMs at all. For SVMs it is much nicer if the way a model is specified is separate from where a model is instantiated. I explored the idea of having "Specification" structures that just describe what one wants to learn, without containing any of the output parameters

train(CSVM(HingeLoss(), L2Penalty(), C = .1), X, y)

The problem here is that now we would have two type trees. one for the specifying structure and one for the trained model (aka LearnableTransform)

Another thing one could do is specify which type one expects to be returned and use this as dispatch:

# SVM is a type
train(SVM, HingeLoss(), L2Penalty(), ...)

However, this will surely result in complicated function signatures for train as now train has to take care of default values and checking if the parameter make sense etc.

Evizero commented 8 years ago

In the spirit of having as few types as possible, maybe we could start with just a single type AbstractLoss

How do you mean? If one wants to be able to dispatch on regression vs classification (which I feel very strongly about) there absolutely must be a MarginLoss and DistanceLoss distinction.

joshday commented 8 years ago

MarginLoss and DistanceLoss could live in MLModels. I want to dispatch on these as well, but not everyone will.

tbreloff commented 8 years ago

This is why I want traits. Margin vs distance should be a trait of the loss. Likewise I could see Penalty as a trait. The difference between these different loss components is more trait-like than type-like (IMO)

On Tuesday, June 28, 2016, Josh Day notifications@github.com wrote:

MarginLoss and DistanceLoss could live in MLModels. I want to dispatch on these as well, but not everyone will.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229043016, or mute the thread https://github.com/notifications/unsubscribe/AA492iY93O_LWX6N8WZgfW2gBgvPLSaRks5qQRxqgaJpZM4I_eeP .

Evizero commented 8 years ago

We can put MarginLoss and DistanceLoss into MLModels, but then we should revisit this issue: https://github.com/JuliaML/LearnBase.jl/issues/1

Evizero commented 8 years ago

Margin vs distance should be a trait of the loss.

I disagree there. This to me is one of the few situations where a type tree makes perfect sense. For example, while a general supervised loss can be described as L(y, yhat), a Distance loss can be defined solely as a function of L(residual) where L(y, yhat) = L(|y - yhat|)

tbreloff commented 8 years ago

a Distance loss can be defined solely as a function of L(residual) where L(y, yhat) = L(|y - yhat|)

I have a feeling this is where we're going to have to agree to disagree. This example seems like the perfect example of why "distance vs margin" should be a trait... they're basically the same, except one can be reduced to a simpler form.

Lets forget about losses for a second and think about a hypothetical. I have shapes that I wish to classify, and each shape has a color. In my work, I only ever deal with circles, and they are only ever red or blue, so I come up with the nice type tree:

abstract Circle
    abstract RedCircle
    abstract BlueCircle

... and this is super useful, because I can dispatch many functions on the abstract Circle, and more specific ones on red/blue.

However, after I've built out all this functionality, someone strolls up and says "Can I use your library? I need to work with both circles and rectangles, but they're always blue." So you rebuild your type tree and refactor your method definitions:

abstract Shape
    abstract Circle
        abstract RedCircle
        abstract BlueCircle
    abstract BlueRectangle

... and now everyone's happy, until someone posts an issue "I would really like a type that represents all green shapes." Crap.

This is a stupid example, but I hope its revealing that even if you personally think a type tree makes sense, there's a good chance it's not going to make sense for everyone. It would be so much cleaner if there was a single abstract type with traits:

abstract Shape
trait ShapeColor
trait ShapeShape

Then you would write your methods to dispatch only on those traits which change the underlying functionality.

@timholy... please help us realize this model!! :)

ahwillia commented 8 years ago

Re: AbstractLoss

How do you mean? If one wants to be able to dispatch on regression vs classification (which I feel very strongly about) there absolutely must be a MarginLoss and DistanceLoss distinction.

I was thinking Loss would only be needed by the optimizer. To combine multiple losses:

losses = (QuadraticLoss, L1Penalty)
optimizer = ADMM()
learn(LinearPredictor(),x,y,losses,optimizer)

I actually have code that roughly works like this already: https://github.com/ahwillia/ProxAlgs.jl/blob/master/examples/lasso.jl

Can you give a more explicit example where dispatch on the Loss is necessary?

Evizero commented 8 years ago

@tbreloff I don't think the color example fits because while red and blue circles are conceptually different they differ by a property value and not by structure. The rectangle vs circle is more fitting since they are structurally different, but is a also a case where I would probably not use a common base class but a common interface instead. why? because they probably don't share actual functionality, only signatures.

Margin vs Distance based losses have a different structure and they do not overlap other than the common properties of a supervised loss. aka they have a clear "is-a" relationship with "supervised loss". a margin based loss has all the properties of a supervised loss plus more. a distance based loss has all the properties of a supervised loss plus more. The "more" of both do not overlap. And most importantly, quite a lot of code can be covered by the common baseclass, which means that in contrast to the shape example they do share a lot of common code

That said, if we find a good way to realize the same functionality and performance that the losses have now in a different way that is not more cumbersome to apply then I would be ok with that, but let me ask you this: if this is not a scenario for inheritance, what would be an example for one?

Evizero commented 8 years ago

@ahwillia I am not sure I understand. Please correct me if I misinterpret

What are you gaining by introducing a tuple?
If there is no distinction between ModelLoss and ParameterLoss, what happens if I switch the losses around in your example (i.e. in the tuple)? because it would still be legal code that compiles
- ... if the optimizer uses something like ispenalty(loss) to decide what the penalty is, then it probably wouldn't make good use of dispatch.
concerning Margin vs Distance: following my example from above with LinearRegressor, how would you decide that your code snippet should return a LinearRegressor or a LinearClassifier, without poisoning the type inference.

Evizero commented 8 years ago

Can you give a more explicit example where dispatch on the Loss is necessary?

@ahwillia Your code snipped is actually a decent example to me. I would want that code to return a structure that understands what it does. i.e. depending on the loss i give it (with everything else the same) that structure is either a classifier or a regressor.

Also at some UI level I would like that the interface could make reasonable default domain decisions. Lets say I want a nice high level UI for svm. I would like something along the lines of the following code to work somehow without poisoning the type inference.

svm(X, y, LinearKernel(), EpsilonInsLoss(.2), L2Penalty(.1))

Now depending on what I chose as a loss I will get an SVM classifier or an SVM regressor. depending on the combination of kernel loss-type and penalty I will get a optimizer that makes sense. And it should all work seamlessly even if MLModels gets a new loss implemented that the SVM library does not need to know about

Evizero commented 8 years ago

Another comment to ModelLoss vs ParameterLoss: Personally, I don't think that a penalty should be considered a Loss at all. I'd actually prefer to keep closer to the theory behind it by Ingo Steinwart et al. which is pretty clean, clear, and comprehensive (In which a supervised loss is a function L(y, yhat)). A penalty is in my mind just that: "a penalty". It is a function of the coefficient vector and thus very special to the coefficient based model concept and doesn't really need more of a baseclass than Penalty (while the supervised loss has no idea how yhat was produced and is much more general)

ahwillia commented 8 years ago

Each Loss in the tuple would be a term in the objective function. This is a natural way to specify an optimization problem for consensus ADMM -- maybe less so for other optimizers (?). Basically, the optimizer just needs to know grad(...) and prox(...) on each term in the objective function. It doesn't care about whether the term is a penalty on model performance or on complexity. (By the way I like PerformanceLoss instead of ModelLoss.)

Brainstorming here, we could consider overloading +, - to create new losses.

model = LinearPredictor(X,y)
objective = QuadraticLoss() + L1Penalty()
learn!(model,objective)

# ... brainstorming ...
objective <: CompositeLoss
typealias CompositeLoss Array{Loss,1}
+(a::Loss,b::Loss) = [a,b]

Evizero commented 8 years ago

Oh I see. Yes in terms of thinking about the problem purely as an optimization problem that does make sense.

Well, from a programming perspective thought this seems like a difficult abstraction to implement on a low-level. a ModelLoss is a function of f(y, yhat) while a ParameterLoss is a function of f(w). How could one treat them equally?

ahwillia commented 8 years ago

What makes it tricky is online/stochastic optimization. For offline optimization, both can be thought of as f(w). From my ProxAlgs prototype package (here A => X and B => Y in our conversation):

type CachedLeastSquares <: Penalty
    loss::Function
    AtA_ρI::Base.LinAlg.LU # LU factorization of AtA + ρ*I
    AtB::Matrix # Cached A'*B
    ρ::Real
    function CachedLeastSquares(A::Matrix,B::Matrix,ρ::Real)
        AtA_ρI_ = lufact(A'*A + (1/ρ)*eye(size(A,2)))
        f(x::Vector) = 0.5 .* sum( A*x - B ).^2
        return new(f,AtA_ρI_,A'*B,ρ)
    end
end

Basically, when you create the loss, you pass the whole dataset to it. This works really well so long as the data isn't astronomically large (which is an important case).

tbreloff commented 8 years ago

A penalty is my my mind just that, "a penalty".

I can't disagree with this, but what I'd like is to abstract that these are just components of an objective function.

I like PerformanceLoss, which gets closer to the truth, but I'm still not sold that we need those branches of a type tree in order to dispatch correctly. If we can dispatch on the trait instead of the type, I think it opens up to cleaner abstractions (like in my shape example).

I want to review the current options for doing traits in julia, and I'll put together a prototype if I come up with something I like.

tbreloff commented 8 years ago

Something I haven't really focused on, but which is important to me: I want to be able to frame reinforcement learning problems in this paradigm as well, which can be thought of like:

Learning to maximize a perpetual sum of discounted future rewards (or minimize the negative rewards)
Transforming (state+input) into (action)

I think adding this to the mix helps me to expand the abstractions:

Transformations

A Transformation should know what type of input it can receive, what it's parameters are, what its outputs are, and how to update its parameters. Some notes:

A static transformation is nothing special... it just has no parameters, and thus updating its parameters is a no-op. (is_static may be a trait)
A transformation may have a generative/stochastic component (is_generative may be a trait)
A transformation may take no inputs (is_source may be a trait)
A transformation may output probabilities, classifications, actions, or something else... (output_type may be a trait, and the value could be subtypes of Transformed)
A transformation may contain sub-transformations. For example, a neural net will contain layers, and a layer will contain an affine transformation and and nonlinear activation function. (is_leaf may be a trait)
...

Objectives

This is the goal of a learning routine. One is always minimizing or maximizing something (right?), and we don't lose generality if we assume we can negate the objective and always minimize.

An Optimizer/LearningAlgorithm should combine a directed graph of transformations with an objective function and update the Parameters/LearnableParameters in the transformation
An ObjectiveFunction is a function of a list of ObjectiveComponent/Loss
- for transformation t, parameters w, inputs x, and targets y: L(t, x, y, w) = ||t(x) - y||^2 + λ||w||^2 == g(x,y) + h(w)
- thus we can create an ObjectiveFunction that "contains" DistanceLoss and Penalty
- An ideal implementation of ObjectiveFunction would allow for arbitrary functions of ObjectiveComponents
- As mentioned above, DiscountedFutureRewards could be a potential ObjectiveComponent

So if we no longer think of ModelLoss/ParameterLoss as "losses", and instead think of them as "objective function components", then I feel better about the DistanceLoss/MarginLoss/Penalty types. One thing to keep in mind is that the Penalties should probably be closely tied to the transformations, not to the objective function directly. (Imagine the example of a neural net with different penalties on different layers.)

Summary

All transformations would directly subtype abstract Transformation, and would be identified by their traits.
Traits will be used in dispatch, so that we have the same benefits as if we had a fully specified type tree
A LearningAlgorithm knows how to update a Transformation to minimize an ObjectiveFunction
Loss, Penalty, and similar types are subtypes of abstract ObjectiveComponent, and an ObjectiveFunction represents a function of one or more ObjectiveComponents

Evizero commented 8 years ago

@ahwillia I like this very much and I think it should exist, but it should not be the "lowest level" of our implementation for a Loss. A supervised loss itself just cares about y and yhat, nothing else, no data, no prediction model, nothing. I think that is very important.

But it could make a lot of sense to create this CachedLeastSquares which internally utilizes a L2DistLoss, or more generally a CachedLoss with utilizes some Loss. I have actually explore something similar to this here (this was my attempt to work towards something that is easy to use in combination with Optim), but without splitting the penalities in two parts, which I agree would be very powerful and important to implement the new popular algorithms .

ahwillia commented 8 years ago

I like a lot of this. A few nitpicks.

A transformation may have a generative/stochastic component (is_generative may be a trait) A transformation may take no inputs (is_source may be a trait)

This is a bit strange to me. A stochastic transformation taking no inputs does not seem like a transformation to me at all. If we do keep this, then I propose the following tweaks:

To determine whether there are inputs: ~~is_source~~ ... instead use is_generative
To determine whether stochastic: is_stochastic

@Evizero - I agree 100%

Evizero commented 8 years ago

I want to review the current options for doing traits in julia, and I'll put together a prototype if I come up with something I like

@tbreloff +100. Type instability and messy code (both in the package as well as on the user side) are my main concerns, and pretty much the reason why I am so in favour of the Loss type tree. If there is a truly good way to do it differently I am all ears

tbreloff commented 8 years ago

A few nitpicks ... A stochastic transformation taking no inputs does not seem like a transformation to me at all.

As I said earlier, I would consider it "transforming nothing into something". That's kinda cheating, but worth it if we can unify everything under a single abstraction.

I propose the following tweaks

The concept I'd like to separate... can we generate output values stochastically? This would include distributions, but also transformations that have a stochastic component (for example an affine transformation plus normally-distributed error). I suppose a bias could be considered generative (no inputs but outputs a 1), but not stochastic. Does stochastic imply generative, but not the other way around?

ahwillia commented 8 years ago

Stochastic does not imply generative to me. Y = A*X + randn(size(Y)) is a stochastic linear transformation of X -> Y, but it is not generative since I need to tell you X.

One thing we are missing, which I think is crucial, is an abstract type for a RandomVariable. Combining this with Transformation would more or less give us a full graphical framework to specify any model (https://en.wikipedia.org/wiki/Graphical_model). The RandomVariables would be the "nodes" of the graph, while the Transformations are the (directed) edges.

I think this would tidy up our confusion about generative vs stochastic transformations. This would be my counter-proposal:

All Transformations would be deterministic
All Transformations take RandomVariable types as both inputs and outputs. All Transformations must have at least some inputs and outputs.
A RandomVariable node can inject noise/stochasticity into the model, which is how we realize a stochastic model/transformation. We could also have constant variables, e.g. hyperparameters.

This is, of course, not a new proposal on my part. http://pgm.stanford.edu/

I can strongly advocate that this is the most general framework we could adopt, if that's what we're going for.

tbreloff commented 8 years ago

Alex I give a tentative thumbs up. I need to think this through and review PGM theory before I agree that's what we want.

On Tuesday, June 28, 2016, Alex Williams notifications@github.com wrote:

Stochastic does not imply generative to me. Y = A*X + randn(size(Y)) is a stochastic linear transformation of X -> Y, but it is not generative since I need to tell you X.

One thing we are missing, which I think is crucial, is an abstract type for a RandomVariable. Combining this with Transformation would more or less give us a full graphical framework to specify any model ( https://en.wikipedia.org/wiki/Graphical_model). The RandomVariables would be the "nodes" of the graph, while the Transformations are the (directed) edges.

I think this would tidy up our confusion about generative vs stochastic transformations. This would be my counter-proposal:

All Transformations would be deterministic

All Transformations take RandomVariable types as both inputs and outputs. All Transformations must have at least some inputs and outputs.

A RandomVariable node does not necessarily need a parent. If it does not have a parent then it is injecting noise/stochasticity into the model, which is how we realize a stochastic model/transformation.

This is, of course, not a new proposal on my part. http://pgm.stanford.edu/

I can strongly advocate that this is the most general framework we could adopt, if that's what we're going for.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229203048, or mute the thread https://github.com/notifications/unsubscribe/AA492n3zRV1EFJptoEXy-8mTRAsvnWaUks5qQZ-AgaJpZM4I_eeP .

ahwillia commented 8 years ago

I need to think about it too. Maybe we could get @jmxpearson to chime in here, since I think we're starting to converge on something similar to what he implemented in VinDsl.jl

Also see: http://people.csail.mit.edu/dhlin/jubayes/julia_bayes_inference.pdf

And how models are specified in Lora.jl and Mamba.jl: http://mambajl.readthedocs.io/en/latest/tutorial.html#bayesian-linear-regression-model

The goal is to provide a unified API to all of these "backends" right?

jmxpearson commented 8 years ago

As the Stan development team often points out, there are useful probabilistic models that are not graphical. Those models, are, however, often defined by objective functions to be optimized.

I think what we've got in VinDsl was overengineered. It's in drastic need of simplifying. Once I realized that the goal of much ML research is to violate the assumptions of previous models in order to improve in performance, I gave up on providing a general framework. You can get 80% of the way there in my case with convenience methods.

On Tue, Jun 28, 2016 at 19:46 Alex Williams notifications@github.com wrote:

I need to think about it too. Maybe we could get @jmxpearson https://github.com/jmxpearson to chime in here, since I think we're starting to converge on something similar to what he implemented in VinDsl.jl

Also see: http://people.csail.mit.edu/dhlin/jubayes/julia_bayes_inference.pdf

And how models are specified in Lora.jl and Mamba.jl: http://mambajl.readthedocs.io/en/latest/tutorial.html#bayesian-linear-regression-model

The goal is to provide a unified API to all of these "backends" right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229217411, or mute the thread https://github.com/notifications/unsubscribe/ADBM5-cTA-GmLqwzRzY9UKw1ZoHcN6aiks5qQbI-gaJpZM4I_eeP .

cstjean commented 8 years ago

@jmxpearson Sorry if this is off-topic everyone, but what are those non-graphical probabilistic models? Non-parametric models?

jmxpearson commented 8 years ago

I'm not sure what they have in mind, but Markov Random Fields are undirected graphs and so lack the normal parent/child structure.

More generally, you could have p(x) = exp(L(x))/Z and there's no guarantee that L(x) could be naturally decomposed into a sum of terms corresponding to links in a PGM. For Stan, and for optimization approaches, all you need is L, which isn't required to have a particular decomposition.

On Tue, Jun 28, 2016 at 22:31 Cédric St-Jean notifications@github.com wrote:

@jmxpearson https://github.com/jmxpearson Sorry if this is off-topic everyone, but what are those non-graphical probabilistic models? Non-parametric models?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229240363, or mute the thread https://github.com/notifications/unsubscribe/ADBM54O4kDDyFDvHrA1WIs0Fg80IJrFIks5qQdj7gaJpZM4I_eeP .

tbreloff commented 8 years ago

I think there are a lot of conceptual overlaps between what I suggested and PGM, Mamba, and others. That's a good thing, and is encouraging, since it would be great if there's a path to unifying approaches. I'm not convinced on the Transformation vs RandomVariable breakdown though, mainly for implementation/interface reasons.

All Transformations would be deterministic

This means that a neural net, or regression, or decision tree (or pretty much anything useful) would no longer be a Transformation. It also wouldn't be a RandomVariable.

My counter-counter-proposal is that we use the (existing) subtype abstract Mapping to define deterministic functions (activation functions, log/exp, etc) from input(s) to output(s). I like your idea of a random variable as generator, but I think we still want it to be a subtype of a transformation so that it can be part of the graph.

The difference between the approaches is subtle, but I think important. You're thinking about a graph as data/variables/parameters (nodes) connected by deterministic functions (edges), where I'm thinking of transformations (nodes) connected together, where the edges only describe "model flow"... the edges aren't an object of any kind.

These are equivalent theoretically, but it's been my experience with OnlineAI that it's easier and more efficient to be able to "clump together" the full model specification (input + do_something + output) into a logical graph node. Sub-components (sub-transformations) are similarly clumped. This way a "node" can actually represent an entire sub-graph which is entirely valid and specified.

I should note that this "clumping together" is not always how people approach the problem, and in OnlineAI I ended up deriving the full backprop formulas to accommodate. In the end it was much simpler math, and the implementation was also simpler with less "leakage" (i.e. nodes could be computed without much knowledge about the surrounding connections).

cstjean commented 8 years ago

The big challenge with regards to PGMs is that in general they are structured models. The scikit-learn project made an explicit decision not to support them because the training set cannot be treated as one big uniform matrix. It's not insurmountable, but we'd have to think about it carefully.

ahwillia commented 8 years ago

Another problem with the PGM approach -- exactly representing arbitrary random variables undergoing arbitrary transformations often becomes intractable rather quickly. This is why sampling is so widespread.

I suggest that we keep it simple and start playing around with a prototype. For me, the biggest win will be to have a single repository that stores loss functions along with derivatives and prox operators. So I vote that we start building that. (Is this going to be called MLModels.jl?)

cstjean commented 8 years ago

For me, the biggest win will be to have a single repository that stores loss functions along with derivatives and prox operators. So I vote that we start building that. (Is this going to be called MLModels.jl?)

+1. Distributions.jl is a good model for that IMHO. It's a single-purpose library, and I like that I don't have to buy into a wider ecosystem to use it.

Evizero commented 8 years ago

All these discussions make me suspect that we all talk about different layers of a complex system (like the CachedLeastSquares example).

I think we should break the parts down into atomic units as @cstjean also just suggested.

I would like to break MLModels.jl up into a SupervisedLosses.jl package that is just concerned with Losses as a function f(y, yhat) as they are a well studied concept and the implementation is as low level as it gets. No prediction models. no penalties. just the supervised losses
Higher level stuff like @tbreloff suggested with ObjectiveComponent (which to me seems to be the same thing as the CachedLeastSquares suggestion) should use the loss as building blocks.
Penalties should live somewhere else. maybe the same place where these CachedLosses live. I.e. a package that is concerned with penalties as a function of f(w)
Maybe a package MLLinearModels.jl that builds on SupervisedLosses.jl and connects losses with a linear prediction function (which is currently called EmpiricalRisk in MLModels)

tbreloff commented 8 years ago

Higher level stuff like @tbreloff suggested with ObjectiveComponent (which to me seems to be the same thing as the CachedLeastSquares suggestion)

I don't understand this comment. ObjectiveComponent would be the abstraction which includes many different concepts that might go into an objective function. Losses and Penalties are included in this list. Maybe I'm just confused on what CachedLeastSquares is.

The idea is that an ObjectiveFunction and a LearnableTransformation could be combined with a LearningAlgorithm and some sort of data iterator to learn optimal parameters for the "graph". I'm assuming that a transformation can be a complex graph of connected sub-transformations... a neural net is just one example. If that LearnableTransformation has a generative/stochastic trait, then one could also sample generated outputs from that transformation.

Here's my current thinking on the types (just a sample... should be enough to convey structure):

abstract Transformation
    abstract Mapping (or maybe StaticTransformation)
        type SigmoidActivation
        type LogTransform
    abstract LearnableTransformation
        type ArtificialNeuralNet
        type SVM

abstract ObjectiveComponent
    abstract PredictionLoss (or ModelLoss/PerformanceLoss/SomethingElseLoss)
        abstract DistanceLoss
        abstract MarginLoss
    abstract Penalty
    abstract DiscountedFutureRewards

abstract ObjectiveFunction  (can do value/deriv of some function of components)
    abstract EmpiricalRiskFunction

abstract LearningAlgorithm
    abstract OnlineAlgorithm
        type SGDAlgorithm
    abstract OfflineAlgorithm
        type ConvexMinimizer

The idea is that we put the abstract type tree in LearnBase, along with anything required for a traits implementation (query functions, or something else). I also think some methods which act on the abstract types should live there, though I can be swayed on that point. At a minimum, we should keep the method stubs that are already there.

I would like to break MLModels.jl up

Can we try to keep things compact initially and then split them out later as it makes sense? It takes a lot of extra effort to keep various repos in sync, to manage contributors and issues, etc. Lets make it easy for ourselves and dump this stuff (objective components and some common transformations) in MLModels.

ahwillia commented 8 years ago

I would like to break MLModels.jl up

Can we try to keep things compact initially.

I would support a middle road -- start a repo for both losses and penalties (but no models). I think this is a good model to follow for Distances.jl and Distributions.jl. I could contribute to this.

tbreloff commented 8 years ago

Alex when you say "losses and penalties (but no models)"... What exactly do you mean by a "model" here? If you mean "empirical risk model" and other standard objective formulas, then I agree (see below).

Regarding common transformations (log transform, activation functions, etc) I think it would make life easier to also include them in MLModels. We can bikeshed that name if it helps.

We discussed putting ParameterUpdater (I'm happy to bikeshed that name) and the concrete implementations of OnlineAlgorithm into StochasticOptimization.jl. Are we still happy with that plan?

Maybe we can have a new package (ObjectiveFunctions.jl?) to house the implementation of ObjectiveFunction (but the abstract type should still be defined in LearnBase, just like all the other top-level abstractions). In my view, this would house a general, clean way to define arbitrary functions of ObjectiveComponents, allowing one to call value, deriv, etc. We could consider sticking an empirical risk model here. I could also see putting all this in MLModels.... Happy to defer to the group.

How does everyone feel about my most recent abstract type tree that I proposed?

Also, lets assign "leads" to each package, so there's someone coordinating the design and development. I'm happy to take on the lead role for an ObjectiveFunctions.jl, at a minimum (pun intended), and maybe MLModels. Alex: you wanted to run with StochasticOptimization, right? Josh did you also want to work on that? Christof I know you're busy, so we're happy to have you help with whatever you can. Anyone else... Please volunteer yourself for something which you have expertise in.

On Wednesday, June 29, 2016, Alex Williams notifications@github.com wrote:

I would like to break MLModels.jl up

Can we try to keep things compact initially.

I would support a middle road -- start a repo for both losses and penalties (but no models). I think this is a good model to follow for Distances.jl https://github.com/JuliaStats/Distances.jl and Distributions.jl https://github.com/JuliaStats/Distributions.jl. I could contribute to this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/8#issuecomment-229474725, or mute the thread https://github.com/notifications/unsubscribe/AA492vRBPmPvLj-lkdQlv9TZLdLl2saVks5qQtJqgaJpZM4I_eeP .

ahwillia commented 8 years ago

I'd love to collaborate with @joshday on the stochastic optimization stuff. Seeing as that package will be a consumer of ObjectiveFunctions.jl (a name that I like), I will likely keep tabs on that as well.

ahwillia commented 8 years ago

Another quick note. I updated a prototype for Transformations. I haven't yet included types for Mapping and LearnableTransformation -- it seems like we might be able to just use traits to distinguish these?

Here is how I might implement static transformations as invertible pairs: https://github.com/ahwillia/MLTransformations.jl/blob/master/src/static.jl#L8

Here is a sketch of stochastic transformations: https://github.com/ahwillia/MLTransformations.jl/blob/master/src/stochastic.jl

I'm not attached to this code at all, but posting here in the off chance it is helpful. Feel free to integrate into whatever prototypes you have.

JuliaML / META