Refactoring - Githubissues

dfdx commented 8 years ago

As recently discussed in #9, it will be nice to refactor the project to better reflect different kinds of RBMs and different ways to learn it. Currently I can think of the following variations:

RBM kinds

ordinary RBMs (with Bernoulli and Gaussian variables, as well as using pure probabilities for visible layer)
Conditional RBMs
CLRBM - implementation of RBM for OpenCL (TBD)
I also have particular interest in implementing original Harmonium by Paul Smolensky as it may work better for very large RBMs with sparse weight matrix
gradient calculation
contrastive divergence
persistent contrastive divergence
EMF approximation by @sphinxteam
weight update options
learning rate
momentum
weight decay
sparsity

Given these requirements, I imagine minimal RBM setup to look something like this (for all but CLRBM types):

type RBM
    W::Matrix{Float64}
    vbias::Vector{Float64}
    hbias::Vector{FLoat64}
    gradient::Function
    update!::Function
end

where gradient() and update!() are closures initialized with all needed options. Then a single method fit() can use these closures to learn RBM and provide monitoring.

Note that this doesn't restrict implementations to the single type, but instead provides a contract, i.e. specifies fields required to provide compatibility between different methods for learning and inference.

@Rory-Finnegan @eric-tramel Does it makes sense for you? Do your models fit into this approach?

rofinn commented 8 years ago

Yeah, that was very similar to what I was thinking.

+1

Since I don't tend to consider fields as part of the public interface, I'd be inclined to have the interface only define required methods rather than fields, but that's just me .

This is probably a minor implementation detail, but if there was a way of making the update functions more composable while remaining relatively performant that would be ideal. For example, there are several methods for computing sparsity and decay penalties with different parameters and it would be nice if the update function could take these arbitrary calculations as functions. I suppose that would probably result in less efficient matrix operations though...? I suppose another options could be to provide extra update! closures which trade off performance in favour of composability and document that. Thoughts?

As for the ConditionalRBM, this would cover my use cases fine.

dfdx commented 8 years ago

Since I don't tend to consider fields as part of the public interface, I'd be inclined to have the interface only define required methods rather than fields, but that's just me.

I imagine it more like protected fields in OO-languages, i.e. something that only related types know about. For all outer interaction I tend to use accessor method as well.

For example, there are several methods for computing sparsity and decay penalties with different parameters and it would be nice if the update function could take these arbitrary calculations as functions. I suppose that would probably result in less efficient matrix operations though...?

I thought about something like this:

type RBM
    ...
    update!::Function
end

function decay_weight!(W::Matrix{Float64}, opts)
    decay_rate = opts[:weight_decay]
    scal!(length(W), decay_rate, W, 1)  # W <- decay_rate * W, in place
end

function add_sparsity!(W::Matrix{Float64}, opts)
    ...
end

function compose_updates(fn_opts...)
    function update!(W::Matrix::Float64)
        for fn, opts in fn_opts
            fn(W, opts)
        end
    end
    return update!
end

...
rbm.update! = compose_updates((weight_decay, Dict(decay_rate=>0.9)), (add_sparsity, Dict(...)))

compose_updates composes several functions, each of which updates weight matrix in-place. This way we have zero performance overhead for matrix operations and only tiny (say, 0.01%) overhead for calling function variables instead of native dispatching.

(There should be better syntax for setting update function, but we can figure it later.)

rofinn commented 8 years ago

Alright, that makes sense. Thanks.

rofinn commented 8 years ago

Hi, I just came across this paper and was wondering if the refactoring could include a way of setting the floating point precision used?

dfdx commented 8 years ago

I see no fundamental limitations for this, though API is unclear for me know.

rofinn commented 8 years ago

What about just parameterizing the RBM and adding a precision keyword to the constructor?

dfdx commented 8 years ago

This is exactly what I'm experimenting with right now in branch refactoring (see rbm2.jl for a state of affairs). One issue that I've encountered is that BLAS procedures are defined only for Float32 and Float64 (as well as their complex equivalents - Complex64 and Complex128, but I believe they are out of scope). One option is to wrap ordinary matrix operations into functions with the same names (e.g. define scal!(alpha::Float16, a::Array{Float16,2}) = alpha .* a), but performance may be just terrible:

julia> @time for i=1:1_000_000 scal!(200, Float32(2), X32, 1) end  
   0.079212 seconds

@time for i=1:1_000_000 Float16(2.) * X16 end     
   2.765985 seconds (2.00 M allocations: 534.058 MB, 0.75% gc time)

I believe internally in non-BLAS operations on arrays of floats < Float64 these arrays are first converted into Float64 and after operation converted back to smaller representation. So I'm not sure what we gain here.

On other hand, BLAS can operate on Float32 directly and as efficiently as with Float64, but requires smaller space (in exchange for reduced precision). So having RBM parametrised by element type is a good feature anyway.

rofinn commented 8 years ago

Yeah, it still might be useful to give the option of at least using Float32. I imagine the only times Float16 might be useful is for network transfers in distributed learning rules, which is probably outside of the scope here.

eric-tramel commented 8 years ago

@Rory-Finnegan, @dfdx: The modifications you both describe seem like a good direction for the repository. Especially the application of compassable update functions, this can certainly allow for a lot of modularity in the kinds of regularizations that get applied to parameter updates.

Speaking of modularity, perhaps one should consider addressing the possibility of non-binary distributions for the visible units (and perhaps even the hidden units). @dfdx had an initial implementation of Gaussian-distributed visible units, for example. (We're working on some extensions on this point, we'll have something to say about it in the next months).

However, using general priors would require a change to the RBM structure to include fields for an arbitrary number of prior parameters at each variable on the hidden and visible side. E.g.

type RBM
    W::Matrix{Float64}
    vParam::Mat{Float64}
    hParam::Mat{FLoat64}
    gradient::Function
    update!::Function
end

In the case of binary distributions, these Param fields resolve to the normal hbias and vbias. For Gaussian distributed units, these become a mean and variance. Arbitrary discrete distributions might be desired as well. I write the Params as matrix because I don't know the best possible way of composing this arbitrary structure. Of course, initialization by Distribution type should properly assign the space for the parameters.

Additionally, the gradient updates on variable distribution parameters should be handled differently from the updates of the couplings, W. Specifically, the parameter update on vParam and hParam should be generalized such that a variable number of prior-dependent gradients could be utilized.

dfdx commented 8 years ago

Speaking of modularity, perhaps one should consider addressing the possibility of non-binary distributions for the visible units (and perhaps even the hidden units).

I have already expanded it in refactoring branch to support degenerate distribution, which allows to always take exact mean values for specified units during sampling (this was recommended in Hinton's practical guide on training RBMs, though formulating it as degenerate distribution is mine). Adding support for arbitrary distributions sounds not that easy for me however, because they lead to change in log-likelihood and thus gradient.

Also, I don't think we really need to change hbias and vbias. Essentially, these 2 fields are not parameters of a distribution, but instead weights in log-linear model used to calculate clique potentials and then energy. Recall that sampling is done from distribution means calculated (for hidden units) as logistic(W * vbias). And these means are essentially one of parameters of distribution of hidden and visible variables. So I think we need to keep hbias and vbias to be just weights.

Instead, for custom distributions we might add 2 more fields (I will follow your notation except for naming conventions) - vparams and hparams. Since we already have excellent Distrubutions.jl package, we can use its types for storing distribution parameters. Thus, we get the following structure:

type RBM{T,V,H}
    W::Matrix{T}
    hbias::Vector{T}
    vbias::Vector{T}
    hparams::Vecotr{H}
    vparams::Vector{V}
end

Note, that I removed gradient and update! from RBM fields (and put them into fit() config parameters instead) to split structure of RBM and its training method.

Does it sound reasonable?

dfdx commented 8 years ago

@Rory-Finnegan I'm near the end of refactoring conditional RBMs now, but I'm not sure how to test if they still behave correctly. For ordinary RBMs I can take, say, MNIST and inspect fitting result visually, as well as check final pseudo-likelihood. Do you have any dataset / example that I can use for testing conditional RBMs?

rofinn commented 8 years ago

The dataset I'm currently testing on isn't publicly available, but I believe Graham Taylor used MNIST in one of his papers. The general approach used was train the CRBM to condition the clean image on a noisy version of it. For testing, you can look at the accuracy of predicting "cleaned" image given their noisy counterparts. If you want I can write these tests.

dfdx commented 8 years ago

It would be perfect. Then we will be able to use such tests as a usage example as well.

rofinn commented 8 years ago

Sounds good. Is there a branch you'd like me to add that to?

dfdx commented 8 years ago

You can add it to the master and I will then merge changes to the refactoring branch and compare results.

rofinn commented 8 years ago

PR #21 opened with those tests.

dfdx commented 8 years ago

Great, thank you! I'm going to create a PR for refactoring tomorrow so you could check if new design matches your thoughts / needs.

rofinn commented 8 years ago

I just realized that the conditional rbm could be generalized a bit better, should I make a PR to master or wait till the refactoring gets merged? The issue is that I'm currently just conditioning on previous observations, but there isn't any reason you couldn't condition on arbitrary things.

dfdx commented 8 years ago

@Rory-Finnegan Does it mean that we can use ConditionalRBM for classification then? If so, it opens quite a lot of new possibilities :)

PR #22 is now open. Please, check it out and if it looks mostly good, then you can base your changes on it. Otherwise, feel free to make a separate PR to master.

rofinn commented 8 years ago

@dfdx In theory, yes, you could use a conditional rbm for classification. You'd just need to softmax the label predictions. I'm not sure how common that is or if there are papers that compare it to other common methods, but we could always test it on the MNIST dataset and see how it does.

The PR looks good to me, so I'll just add my changes to it.

rofinn commented 8 years ago

PR #23 to refactoring open.

rofinn commented 8 years ago

So 2 things I've thought of while going through the refactoring branch is

It would be nice if the precision was a bit more flexible. I might want an Float32 RBM, but it isn't maintainable for me to update all of my code to convert Float64 context variables and input data to Float32.
I'm currently randomizing the ordering of patterns in my training matrices, but it would be nice if the context supported a randomize batches flag to do this when creating batches.

Thoughts?

dfdx commented 8 years ago

It makes sense, but I'm worried about implicit performance issues. For example, if we silently convert each batch before processing, user can get performance degradation on each iteration without even knowing about the issue. Explicit flag allowing conversion of just good documentation may help, but first I should evaluate performance consequences of such implicit conversion.
It should be quite easy to add randomization for for batch order, I added it to TODO list.

dfdx commented 8 years ago

Randomization for batch order is here. I need one more day for flexible precision (should be trivial to implement, but I'm very short in time now).

rofinn commented 8 years ago

Awesome, thank you! I'm still getting some NaNs and Infs on the pseudo-likelihood calculation, so I'll probably have another PR once I figure out where they're coming from.

dfdx commented 8 years ago

Regarding flexible precision, I just pushed a change that converts each batch to an array of RBM's type in a fit() function. Adding to to other parts of the code is trickier than I thought and creates several design questions. I will open a separate issue to address this questions.

Meanwhile, I see that this branch is mostly ok, so I'm going to merge it into master in 24 hours and add discuss new features separately.

rofinn commented 8 years ago

FYI, with the recent changes I'm getting a bunch of warnings like WARNING: Option 'persistent_chain' is unknownm ignoring. I'm guessing that persistent_chain, dW_buf and dW_prev just needed to be added to KNOWN_OPTIONS?

dfdx commented 8 years ago

Basically not, they should be created during the first call to corresponding method (recall that context holds both - options and buffers, and these 3 are exactly the latter). Do you need to pass them explicitly for some reason?

rofinn commented 8 years ago

I wasn't explicitly passing them in, but I figured why I was getting the error and you haven't been. I was running multiple RBMs in a row with the same context, which resulted in @get_array setting the values in the context on the first call to fit and produces warning for subsequent calls.

dfdx commented 8 years ago

Ah, that makes sense. I think we can simply copy input context (i.e. options only) instead of using and populating the original one. I will push the change.

dfdx commented 8 years ago

refactoring branch is now merged into master. Please, feel free to open separate issues to address further improvements.

dfdx / Boltzmann.jl

Refactoring #19

RBM kinds

gradient calculation

weight update options