Closed dfdx closed 8 years ago
Yeah, that was very similar to what I was thinking.
+1
Since I don't tend to consider fields as part of the public interface, I'd be inclined to have the interface only define required methods rather than fields, but that's just me .
This is probably a minor implementation detail, but if there was a way of making the update functions more composable while remaining relatively performant that would be ideal. For example, there are several methods for computing sparsity and decay penalties with different parameters and it would be nice if the update function could take these arbitrary calculations as functions. I suppose that would probably result in less efficient matrix operations though...? I suppose another options could be to provide extra update!
closures which trade off performance in favour of composability and document that. Thoughts?
As for the ConditionalRBM, this would cover my use cases fine.
Since I don't tend to consider fields as part of the public interface, I'd be inclined to have the interface only define required methods rather than fields, but that's just me.
I imagine it more like protected
fields in OO-languages, i.e. something that only related types know about. For all outer interaction I tend to use accessor method as well.
For example, there are several methods for computing sparsity and decay penalties with different parameters and it would be nice if the update function could take these arbitrary calculations as functions. I suppose that would probably result in less efficient matrix operations though...?
I thought about something like this:
type RBM
...
update!::Function
end
function decay_weight!(W::Matrix{Float64}, opts)
decay_rate = opts[:weight_decay]
scal!(length(W), decay_rate, W, 1) # W <- decay_rate * W, in place
end
function add_sparsity!(W::Matrix{Float64}, opts)
...
end
function compose_updates(fn_opts...)
function update!(W::Matrix::Float64)
for fn, opts in fn_opts
fn(W, opts)
end
end
return update!
end
...
rbm.update! = compose_updates((weight_decay, Dict(decay_rate=>0.9)), (add_sparsity, Dict(...)))
compose_updates
composes several functions, each of which updates weight matrix in-place. This way we have zero performance overhead for matrix operations and only tiny (say, 0.01%) overhead for calling function variables instead of native dispatching.
(There should be better syntax for setting update function, but we can figure it later.)
Alright, that makes sense. Thanks.
Hi, I just came across this paper and was wondering if the refactoring could include a way of setting the floating point precision used?
I see no fundamental limitations for this, though API is unclear for me know.
What about just parameterizing the RBM and adding a precision keyword to the constructor?
This is exactly what I'm experimenting with right now in branch refactoring
(see rbm2.jl for a state of affairs). One issue that I've encountered is that BLAS procedures are defined only for Float32
and Float64
(as well as their complex equivalents - Complex64
and Complex128
, but I believe they are out of scope). One option is to wrap ordinary matrix operations into functions with the same names (e.g. define scal!(alpha::Float16, a::Array{Float16,2}) = alpha .* a
), but performance may be just terrible:
julia> @time for i=1:1_000_000 scal!(200, Float32(2), X32, 1) end
0.079212 seconds
@time for i=1:1_000_000 Float16(2.) * X16 end
2.765985 seconds (2.00 M allocations: 534.058 MB, 0.75% gc time)
I believe internally in non-BLAS operations on arrays of floats < Float64
these arrays are first converted into Float64
and after operation converted back to smaller representation. So I'm not sure what we gain here.
On other hand, BLAS can operate on Float32
directly and as efficiently as with Float64
, but requires smaller space (in exchange for reduced precision). So having RBM parametrised by element type is a good feature anyway.
Yeah, it still might be useful to give the option of at least using Float32. I imagine the only times Float16 might be useful is for network transfers in distributed learning rules, which is probably outside of the scope here.
@Rory-Finnegan, @dfdx: The modifications you both describe seem like a good direction for the repository. Especially the application of compassable update functions, this can certainly allow for a lot of modularity in the kinds of regularizations that get applied to parameter updates.
Speaking of modularity, perhaps one should consider addressing the possibility of non-binary distributions for the visible units (and perhaps even the hidden units). @dfdx had an initial implementation of Gaussian-distributed visible units, for example. (We're working on some extensions on this point, we'll have something to say about it in the next months).
However, using general priors would require a change to the RBM structure to include fields for an arbitrary number of prior parameters at each variable on the hidden and visible side. E.g.
type RBM
W::Matrix{Float64}
vParam::Mat{Float64}
hParam::Mat{FLoat64}
gradient::Function
update!::Function
end
In the case of binary distributions, these Param fields resolve to the normal hbias
and vbias
. For Gaussian distributed units, these become a mean and variance. Arbitrary discrete distributions might be desired as well. I write the Params as matrix because I don't know the best possible way of composing this arbitrary structure. Of course, initialization by Distribution type should properly assign the space for the parameters.
Additionally, the gradient updates on variable distribution parameters should be handled differently from the updates of the couplings, W
. Specifically, the parameter update on vParam
and hParam
should be generalized such that a variable number of prior-dependent gradients could be utilized.
Speaking of modularity, perhaps one should consider addressing the possibility of non-binary distributions for the visible units (and perhaps even the hidden units).
I have already expanded it in refactoring
branch to support degenerate distribution, which allows to always take exact mean values for specified units during sampling (this was recommended in Hinton's practical guide on training RBMs, though formulating it as degenerate distribution is mine). Adding support for arbitrary distributions sounds not that easy for me however, because they lead to change in log-likelihood and thus gradient.
Also, I don't think we really need to change hbias
and vbias
. Essentially, these 2 fields are not parameters of a distribution, but instead weights in log-linear model used to calculate clique potentials and then energy. Recall that sampling is done from distribution means calculated (for hidden units) as logistic(W * vbias)
. And these means are essentially one of parameters of distribution of hidden and visible variables. So I think we need to keep hbias
and vbias
to be just weights.
Instead, for custom distributions we might add 2 more fields (I will follow your notation except for naming conventions) - vparams
and hparams
. Since we already have excellent Distrubutions.jl package, we can use its types for storing distribution parameters. Thus, we get the following structure:
type RBM{T,V,H}
W::Matrix{T}
hbias::Vector{T}
vbias::Vector{T}
hparams::Vecotr{H}
vparams::Vector{V}
end
Note, that I removed gradient
and update!
from RBM
fields (and put them into fit()
config parameters instead) to split structure of RBM and its training method.
Does it sound reasonable?
@Rory-Finnegan I'm near the end of refactoring conditional RBMs now, but I'm not sure how to test if they still behave correctly. For ordinary RBMs I can take, say, MNIST and inspect fitting result visually, as well as check final pseudo-likelihood. Do you have any dataset / example that I can use for testing conditional RBMs?
The dataset I'm currently testing on isn't publicly available, but I believe Graham Taylor used MNIST in one of his papers. The general approach used was train the CRBM to condition the clean image on a noisy version of it. For testing, you can look at the accuracy of predicting "cleaned" image given their noisy counterparts. If you want I can write these tests.
It would be perfect. Then we will be able to use such tests as a usage example as well.
Sounds good. Is there a branch you'd like me to add that to?
You can add it to the master
and I will then merge changes to the refactoring
branch and compare results.
PR #21 opened with those tests.
Great, thank you! I'm going to create a PR for refactoring tomorrow so you could check if new design matches your thoughts / needs.
I just realized that the conditional rbm could be generalized a bit better, should I make a PR to master or wait till the refactoring gets merged? The issue is that I'm currently just conditioning on previous observations, but there isn't any reason you couldn't condition on arbitrary things.
@Rory-Finnegan Does it mean that we can use ConditionalRBM for classification then? If so, it opens quite a lot of new possibilities :)
PR #22 is now open. Please, check it out and if it looks mostly good, then you can base your changes on it. Otherwise, feel free to make a separate PR to master.
@dfdx In theory, yes, you could use a conditional rbm for classification. You'd just need to softmax the label predictions. I'm not sure how common that is or if there are papers that compare it to other common methods, but we could always test it on the MNIST dataset and see how it does.
The PR looks good to me, so I'll just add my changes to it.
PR #23 to refactoring open.
So 2 things I've thought of while going through the refactoring branch is
I'm currently randomizing the ordering of patterns in my training matrices, but it would be nice if the context supported a randomize batches flag to do this when creating batches.
Thoughts?
Randomization for batch order is here. I need one more day for flexible precision (should be trivial to implement, but I'm very short in time now).
Awesome, thank you! I'm still getting some NaNs and Infs on the pseudo-likelihood calculation, so I'll probably have another PR once I figure out where they're coming from.
Regarding flexible precision, I just pushed a change that converts each batch to an array of RBM's type in a fit()
function. Adding to to other parts of the code is trickier than I thought and creates several design questions. I will open a separate issue to address this questions.
Meanwhile, I see that this branch is mostly ok, so I'm going to merge it into master in 24 hours and add discuss new features separately.
FYI, with the recent changes I'm getting a bunch of warnings like WARNING: Option 'persistent_chain' is unknownm ignoring
. I'm guessing that persistent_chain
, dW_buf
and dW_prev
just needed to be added to KNOWN_OPTIONS
?
Basically not, they should be created during the first call to corresponding method (recall that context holds both - options and buffers, and these 3 are exactly the latter). Do you need to pass them explicitly for some reason?
I wasn't explicitly passing them in, but I figured why I was getting the error and you haven't been. I was running multiple RBMs in a row with the same context, which resulted in @get_array
setting the values in the context on the first call to fit and produces warning for subsequent calls.
Ah, that makes sense. I think we can simply copy input context (i.e. options only) instead of using and populating the original one. I will push the change.
refactoring
branch is now merged into master. Please, feel free to open separate issues to address further improvements.
As recently discussed in #9, it will be nice to refactor the project to better reflect different kinds of RBMs and different ways to learn it. Currently I can think of the following variations:
RBM kinds
gradient calculation
weight update options
Given these requirements, I imagine minimal RBM setup to look something like this (for all but
CLRBM
types):where
gradient()
andupdate!()
are closures initialized with all needed options. Then a single methodfit()
can use these closures to learn RBM and provide monitoring.Note that this doesn't restrict implementations to the single type, but instead provides a contract, i.e. specifies fields required to provide compatibility between different methods for learning and inference.
@Rory-Finnegan @eric-tramel Does it makes sense for you? Do your models fit into this approach?