JuliaML / Transformations.jl

Static transforms, activation functions, and other implementations of LearnBase abstractions
Other
31 stars 4 forks source link

What is a Transformation? #2

Open tbreloff opened 8 years ago

tbreloff commented 8 years ago

I was about to start writing code, but I think it's worth having a discussion first. If we're going to realize my eventual vision of autonomous creation of valid "directed graphs of transformations", then we need to approach the problem systematically. Every transformation should have the same type signature, otherwise we're going to have too many special cases for the framework to be useful.

A Transformation should:

Should this info be part of the type signature as parameters? It might have to be, though I'm going to attempt to solve this through "query functions" that produce traits to be dispatched on. If I'm successful, we don't need the wrappers for Distributions at all... we just need the generic:

output_shape(::Distributions.Sampleable{Univariate}) = 0  # scalar
output_shape(::Distributions.Sampleable{Multivariate}) = 1 # vector
# etc

For generative/stochastic transformations (generating distributions):

I wonder if we should think of the input as "randomness". i.e. we take "randomness" as input, and generate rand(t, output_dims). I don't know the best way to represent this... maybe just immutable Randomness end

The transform should, I think, take in nothing, and output the center. Think about if you have a generative process:

T := y = wx + N(mu,sigma)

where T is a Transformation. What does it mean to transform x into y? I think it probably means to give our best estimate of y given x, so: transform(T, x) == w * x + mu. If that's the case, then transform(N) == mu.

What does it mean to generate an output from T? We're walking into the land of Bayesians here... I started to answer this but need more time to think it through.

What does it mean to learn the parameters w? I think: learn(T, x, y) means to learn w, mu, and sigma in parallel.

Do we agree on these definitions? If so then we need some drastic changes to the first PR.

Evizero commented 8 years ago

To give a short reply I can list a few requirements that I would need one way or the other

Some discussion topics:

tbreloff commented 8 years ago

Additionally, what would be really powerful is if we didn't need to subtype from Transformation at all, as long as the query functions were implemented for the type, and anything implied by the traits was implemented. What if we re-think the type tree to be the type tree of traits? Then Transformation is the abstract type of the trait, not the actual transformation object. Let me think out loud:

# I/O are the shapes (num dims) of the input and output
abstract TransformationType{I, O}
    immutable Transformable{I,O} <: TransformationType{I,O} end
    immutable NotTransformable{I,O} <: TransformationType{I,O} end

transform(nt::NotTransformable, args...) = error()

typealias Generative{O} Transformable{Void, O}

# this is the object which will do an affine transformation: yhat = wx + b
immutable Affine{W, B}
    w::W
    b::B
end

transform(aff::Affine, x) = aff.w * x + aff.b  # actual code would be more efficient

# these methods allow us to "plug in" an Affine object in with other "Transformable" objects,
#    even though they can have their own disparate type systems
# note: can't be exactly like this to maintain type stability, but something similar
input_shape(aff::Affine) = (size(w,2), size(b,2))
output_shape(aff::Affine) = size(b)
transformation_type(aff::Affine) = Transformation{1,1}

# we could similarly do `rand` implementations using a different trait.
# hopefully you understand the pattern I'm trying for:

abstract Randomness
    immutable Static <: Randomness end
    immutable Stochastic <: Randomness end

# everything is static, unless we specify
randomness(x) = Static()

generate(t, args...) = generate(randomness(t), t, args...)
generate(::Static, args...) = error()
generate(::Stochastic, t, args...) = rand(t, args...)

randomness(d::Distribution) = Stochastic()

# now we can "plug in" Distribution objects, as long as they implement `rand`
# we don't need wrapper types
tbreloff commented 8 years ago

A way to dispatch on a bias being present. Maybe name the types Linear... vs Affine...

If the type is parameterized, then typealias Linear{W} Affine{W,Void}?

for performance reasons a collumn should denote an observation.

I'm on board with this. I regret the decision in OnlineStats.

@joshday mentioned in his juliacon talk that there might be a efficient way to have rows for observations now?

Yes... Tim Holy's PermutedDimsArray. They are apparently hidden in Base 0.5. I think this should be the method of using a row-based table.

joshday commented 8 years ago

for performance reasons a collumn should denote an observation.

This isn't universally better. It's great for SGD, but bad for coordinate descent. Enforcing one or the other may be enough to keep people from buying in to JuliaML. I would prefer the default match notation (observations in rows), but I could be convinced otherwise.

ahwillia commented 8 years ago

The transform should, I think, take in nothing, and output the center.

I think we need at least three (maybe four) verbs here:

Also, I have a nitpick about this example:

T := y = wx + N(mu,sigma)

Should instead be:

T := y = wx + N(0,sigma)

(unless mu is the intercept in the model). The more important point I want to make is that it is actually better to think about the above model like this:

T := y = N(wx,sigma)

The difference is subtle, but important, for GLMs. Basically you think of the total transformation as a linear predictor wx that specifies the first moment of the noise distribution (after an inverse link function is applied). For Poisson regression there is no second moment:

T := y = Poisson(exp(wx))

From a practical standpoint, if you want a generative model, you need to specify sigma in the first example, but no additional parameters for Poisson.

ahwillia commented 8 years ago

I would prefer the default match notation (observations in rows), but I could be convinced otherwise.

+1, though I could also be convinced otherwise.

ahwillia commented 8 years ago

One more thought. If we are really going the Bayes Net route, we should talk with @sisl and maybe sync up with their BayesNets.jl package. cc: @tawheeler

tbreloff commented 8 years ago

T := y = N(wx,sigma)

I like this, if we can figure out the right way to do it. It should probably be T := y = N(wx + b, sigma)

Evizero commented 8 years ago

For SVMs to be fast i need an observation to be adjacent in memory

tbreloff commented 8 years ago

Yeah I guess we should have our convention be y = x'w + b, which is common in Matlab workflows

tbreloff commented 8 years ago

I pushed up some of my experiments... I think they're pretty promising. Check out the main Transformations.jl in the tom branch.

The gist is that there's a immutable Transformation which wraps an arbitrary object, and holds input/output data dimensions in the parameter signature. Then you implement a few "info methods" and one version of transform!, learn!, etc (whichever methods are appropriate), and then you've "linked" your object into the Transformation abstraction. We can build generic methods which dispatch on the results of queries (like is_learnable(o)).

Here's an example from runtests which wraps an OnlineStats.Means object:

# "link" this object into the transformations world
Transformations.input_size(m::Means) = size(m.value)
Transformations.output_size(m::Means) = size(m.value)
Transformations.is_learnable(m::Means) = true
Transformations.transform!(y, m::Means, x) = (y[:] = x - m.value)
Transformations.learn!(m, x) = OnlineStats.fit!(m, x)

# instantiate an OnlineStats.Means, which computes an online mean of a vector
m = Means(5)

# wrap the object in a Transformation, which stores input/output dimensions in its type,
# and allows for the common functionality to apply to the Transformation object.
t = transformation(m)

# update/learn the parameters for this transformation
#   (at this point, we don't care what the transformation is!)
learn!(t, 1:5)

# center the data by applying the transform. this dispatches generically.
# we only need to define a single `transform!` method
y = transform(t, 10ones(5))

# make sure everything worked
@test y == 10ones(5) - (1:5)
tawheeler commented 8 years ago

Hey! Happy to discuss BayesNets. So my initial understanding is that you want to create a sort of computational graph (without the backprop)?

tbreloff commented 8 years ago

Hi Tim. Yes computational graph is part of it. I do care about backprop, though not everyone will. It's more an attempt to generalize and unify different models and approaches. We'd love your comments on the discussion so far.

On Friday, July 1, 2016, Tim Wheeler notifications@github.com wrote:

Hey! Happy to discuss BayesNets. So my initial understanding is that you want to create a sort of computational graph (without the backprop)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Transformations.jl/issues/2#issuecomment-230061891, or mute the thread https://github.com/notifications/unsubscribe/AA492nwC0yoGKdsrQqKYvfcGWRmx3KaBks5qRZaLgaJpZM4JDHaw .

ahwillia commented 8 years ago

One thing I'm curious about is whether we can use the same set of abstractions to specify a computation graph as well as a generative model (more or less as a Bayes net). At the moment, I am interested in the latter for my day job, but I might be in the minority.

This has made me think it isn't so crazy to support both: http://edwardlib.org/

P.S. @tawheeler - Are you in the Bay Area for the summer? I'm at Stanford as well, but am in Livermore until October. Would love to meet and chat about BayesNets.jl in person.

tawheeler commented 8 years ago

This sounds like a very neat concept. I would think that forward and backward would be natural function names to use. BayesNets.jl has each node know its own name and the names of its parents in order to allow it to pull assignments from dictionaries. I don't know whether that would necessarily be best for transformations.jl; nodes may be better off directly pulling from their parents. Nevertheless, one would want to make the design decision as to whether each transformation node works as a stand-alone or whether it only makes sense in the context of a type containing the DAG info. Is this package meant to always output distributions? Note that computational graphs like in TensorFlow do not, in general, produce distributions.

@ahwillia I am, but will be heading to Germany from Oct through the end of the year. Happy to connect you with SISL though, or meet thereafter.

Evizero commented 8 years ago

A little package name nitpick that may come up along the way. I discovered that there is an implicit convention to call packages Transforms instead of Transformations

Not sure if there is a subtle difference between a transform and a transformation, but I thought I'd bring it up

Evizero commented 8 years ago

Should we stick with Transformations then? maybe it would be useful to have a term just for the ML side of things.

I shall refactor the stuff I am working on to ImageTransformation then. Or do we always want to go the "wrap" approach Tom suggested? I suspect that offering a common base class would make opt-in a little easier. so I could do ImageTransformation <: AbstractTransformation or something and just implement relevant methods directly

ahwillia commented 8 years ago

I voted for "Transform" over "Transformation" a while back because it is shorter and (to me) is synonymous.

On Aug 6, 2016 10:30 AM, "Christof Stocker" notifications@github.com wrote:

Should we stick with Transformations then? maybe it would be useful to have a term just for the ML side of things.

I shall refactor the stuff I am working on to ImageTransformation then. Or do we always want to go the "wrap" approach Tom suggested? I suspect that offering a common base class would make opt-in a little easier. so I could do ImageTransformation <: AbstractTransformation or something and just implement relevant methods directly

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Transformations.jl/issues/2#issuecomment-238035210, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm20bafmL23lXNlX8jSGnSQ1BRL6Rqoks5qdMTQgaJpZM4JDHaw .

Evizero commented 8 years ago

I am ok with Transform as well. The length of transformation is already a bit of a hassle (I am using it at the moment). Anyway I'll go with Transformation for now to make progress, but I can refactor again later once the consensus is reached.

tbreloff commented 8 years ago

For reference, copying from gitter:

Tom Breloff @tbreloff i pushed up a mostly-working prototype to Transformations... i'd love comments/criticism on the approach so far if we have any macro-sugar using Flow or something else, i'm expecting that it will just construct these sorts of objects under the hood, so we can tackle that later the high level is that each predefined transformation (like Affine, or an activation function) holds references to nodes in a computation graph, and has storage for values in the forward pass and gradients in a backward pass the output node(s) of one transformation can be "linked" to the input node(s) of another, which means they'll share the underlying array storage, and updating the outputs of one transformation automatically shows the updates in the inputs of a subsequent transformation same goes for gradients in reverse the net effect is that each transformation is fully self-contained... it doesn't need to know about its place in a larger network. it has everything it needs to compute values/gradients

Since then I also implemented many Activation, and a Chain which can be used for basic ANNs and similar. I pushed it up to master, since I don't think we cared too much about the existing master branch. Let me know your thoughts.

tbreloff commented 8 years ago

One piece I'm a little unsure about is how best to add penalties to the mix. Are the penalties a component in each learnable transformation? Or maybe they are passed into every grad! method somehow? I really want to avoid the "implementation leakage" involved with passing penalty objects around, especially if that penalty is applied locally and doesn't impact backpropagated gradients. It's super nice that I was able to implement deep neural nets in 2 lines:

transform!(chain::Chain) = (foreach(transform!, chain.ts); chain.output.val)
grad!(chain::Chain) = foreach(grad!, reverse(chain.ts))

Penalties will mess that up :(

tbreloff commented 8 years ago

I'm going to move on to ObjectiveFunctions next week, and try to piece some of this stuff together. I'm thinking that package will be no more than conveniences to connect Transformations, Losses, and Penalties together for use in StochasticOptimization. At some point we can revisit the package organization... lets get some stuff working in the wild first.

ahwillia commented 8 years ago

Thanks for taking a stab at this - I really like what you've done 💯

Adding the contribution of the penalty to the gradient should be taken care of by a different package right? I don't see why it would be too difficult.

using Transformations
using Penalties
using Losses

type PenalizedDeepNet{P<:Penalty}
    layers::Chain
    penalties::Vector{P} # penalty at each layer
    loss::SupervisedLoss
end

function grad!(net::PenalizedDeepNet, target)
    # gradient of output weights w.r.t. loss
    grad!(net.loss, target, net.layers[end])

    # backprop
    grad!(net.layers)

    # the penalty terms are just added to the overall objective,
    # so their contribution can now be added
    for (layer,penalty) in zip(net.layers, net.penalties)
         addgrad!(layer, penalty)
    end
end

I just did that quickly... but is that the gist?

Evizero commented 8 years ago

It looks like there is quite a lot going on in the code. we should benchmark the affine prediction against a minimal handcrafted version that just does the matrix multiplication plus the bias. I like the generality, but we must tread carefully here.

I will give this a more detailed review once 0.5 is released and I have a bit more time on my hands. Thanks for working on this Tom.

tbreloff commented 8 years ago

Please do benchmark. In theory it should be very close in computation to a handcrafted version.

I started ObjectiveFunctions last night with something similar to your code Alex, so we're on the same page.

On Thursday, September 1, 2016, Christof Stocker notifications@github.com wrote:

It looks like there is quite a lot going on in the code. we should benchmark the affine prediction against a minimal handcrafted version that just does the matrix multiplication plus the bias. I like the generality, but we must tread carefully here.

I will give this a more detailed review once 0.5 is released and I have a bit more time on my hands. Thanks for working on this Tom.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Transformations.jl/issues/2#issuecomment-243987289, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492nXuDeV0iQTBb02iSvC6ruDEXlnvks5qlm-xgaJpZM4JDHaw .

tbreloff commented 8 years ago

Thinking out loud... if a Transformation has a params(trans) method defined which returns a Vector/view/CatView of parameters, and similar a grad(trans) for the vector of gradients, then we can avoid thinking about penalties in the grad! methods of transformations, and simply add something like what you had above to an empirical risk object:

type RegularizedObjective{T<:Transformation, L<:LossTransform, P<:Penalty}
    transformation::T
    loss::L
    penalty::P
end

# assumes we've already updated the Node objects through a forward pass
function grad!(objfunc::RegularizedObjective)
    grad!(objfunc.loss)
    grad!(objfunc.transformation)
    addgrad!(grad(objfunc.transformation), objfunc.penalty, params(objfunc.transformation))
end

# and then the penalty would be something like:

function addgrad!(∇θ::AbstractVector, penalty::Penalty, θ::AbstractVector)
    for i in eachindex(θ)
        ∇θ[i] += deriv(penalty, θ[i])
    end
end

So Losses and Transformations would know nothing about Penalties, and vice versa. ObjectiveFunctions would bind them together in a single grad! update. cc: @joshday

joshday commented 8 years ago

Not sure I completely follow, but I think the pieces are there. Is grad!(objfunc::RegularizedObjective) missing data arguments?

tbreloff commented 8 years ago

Is grad!(objfunc::RegularizedObjective) missing data arguments?

Sort of. I forgot to put it <: Transformation, and when I do that it can use the defaults:

# Copy input values into the input node, then transform
function transform!(t::Transformation, input::AbstractVector)
    copy!(input_value(t), input)
    transform!(t)
end

# Copy the gradient into the output node, and propagate it back.
function grad!(t::Transformation, ∇out::AbstractVector)
    copy!(output_grad(t), ∇out)
    grad!(t)
end

So the "no data args" versions assume that the forward pass already has the input value loaded and populates the output value, and the backward pass assumes the output gradient is loaded and populates the input gradient (as well as the gradients of any learnable parameters).

Part of the Transformations API that I'm playing with is that a Transformation holds forward and backward state vectors. This simplifies creating and passing temporary vectors around and lets you operate on a transformation knowing that there are value and gradient vectors pre-computed before implementing the actual value/gradient logic for each transformation type.

This all might seem like a heavy-handed way to do some matrix multiplies and adds, and apply some functions, but in fact the inner loops of learning algos should be pretty lightweight. Storage can be shared between connected inputs/outputs among transformations, and the complexity is greatly reduced. (I hope!)

Right now I'm working on the ObjectiveFunctions design; how to nicely make the forward and backward passes, and when to pass in the target.

datnamer commented 8 years ago

How do tree based learners fit into this? What about classic statistical tests (wald etc).

tbreloff commented 8 years ago

How do tree based learners fit into this?

Well, they may not be differentiable, so the grad! method may have to throw an error? The transform! method would just copy the output of the tree into the output node of the TreeTransform?

What about classic statistical tests (wald etc)

I'm not prepared to give a complete answer, but note that you should always be able to get the params/gradients of all learnable parameters in a transformation, possibly as a CatView of underlying parameter arrays. This lets you treat a transformation as a black-box function f(θ), and presumably you can compute higher order derivatives and statistics however you want. With that said, it might make sense to have a built-in hessian! method that works like grad!, and goes unused when you don't care about it. If you have a compelling reason why any part of this design doesn't make sense, please let me know!

Evizero commented 8 years ago

Sounds all very nice so far (judging from a distance). Does your design allow for special handling of sparse arrays? (in the sense that down the road could we dispatch on that and execute more efficient code for those cases). Not sure how sparse arrays have evolved recently, but I remember that in KSVM there was a huge speed up when I handled them manually

tbreloff commented 8 years ago

I would say that sparse arrays can certainly be included, though they can't be used out of the box yet. I want this too, as things like Echo State Networks can be represented with a single recurrent sparse connection in a directed graph of transformations.

On Tuesday, September 6, 2016, Christof Stocker notifications@github.com wrote:

Sounds all very nice so far (judging from a distance). Does your design allow for special handling of sparse arrays? (in the sense that down the road could we dispatch on that and execute more efficient code for those cases). Not sure how sparse arrays have evolved recently, but I remember that in KSVM there was a huge speed up when I handled them manually

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Transformations.jl/issues/2#issuecomment-245053517, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492l62cpx4XjfYIZ6mZm-UNsK_RFKAks5qnbe7gaJpZM4JDHaw .

datnamer commented 8 years ago

Makes sense to me!

tbreloff commented 8 years ago

I was just reading back through this issue, and I wanted to comment on how we might integrate with a "BayesNet approach". cc: @tawheeler

I think we could have a abstract BayesTransform <: Transformation which has additional structure in that inputs would be connected to the underlying parameters of a distribution. So in the case that a=N(0,1); b=N(2a+3,1) as in the tutorial for BayesNet.jl, then transform!(a) would sample from N(0,1), then there would be an Affine connecting a to the mu of b, then calling transform!(b) would sample from a N(f(a), 1) where f(a) is the output of the Affine transformation.

To expand a little:

type NormalTransformation{T} <: BayesTransform
    mu::Node{:input,T}
    sigma::Node{:input,T}
    output::Node{:output,T}
    ...
end

function transform!(t::NormalTransformation)
    dist = Normal(value(t.mu), value(t.sigma))
    rand!(dist, value(t.output))
end

Once we have an "arbitrary graph" type, we could presumably track the names/labels of each transformation in a similar manner to how you do it in BayesNet.

tbreloff commented 8 years ago

FYI we already have this is LearnBase:

abstract StochasticTransformation <: Transformation
ahwillia commented 8 years ago

I like it, but is sigma an "input" or a fixed parameter of the transformation? Probably the latter right?

If all the input/outputs are Gaussian transformations then you should be able to analytically calculate the full distribution as output rather than just sample from it. Could be nice to support this as well at some point.

tbreloff commented 8 years ago

is sigma an "input" or a fixed parameter of the transformation?

Could be either I suppose. This could be parameterized.

you should be able to analytically calculate the full distribution as output

More generally, it would be good to be able to compute priors and posteriors for a given transformation... either learning from data/sampling or through analytic means when possible. I think it would be easy to do this using Monte Carlo techniques without too much headache, but analytic distributions could be tricky. (I hope others step in to help with this!)

ahwillia commented 8 years ago

think it would be easy to do this using Monte Carlo techniques without too much headache, but analytic distributions could be tricky.

This isn't my area of expertise, but I actually think the opposite! Gaussians are the only analytically tractable case and there is a cookbook we can follow for them. On the other hand, making effective Monte Carlo samplers seems to be quite involved. There are also variational techniques...

But lets start with just implementing forward sampling approach you've outlined. That will still be a useful first step.

tawheeler commented 8 years ago

FWIW I would really like to see more conjugate priors in JuliaStats. That is something I want to add to all learning in BayesNets.jl.

The "BayesNets" approach is hopefully straightforward. Each node is a CPD which supplies functions to obtain its name, parents, evaluation, learning, etc. It sounds like Transformations is a more general framework based on the same principles.