the batch size in the image (2) at run time is different than at build time (10) for the ConvOp

benanne / morb

Modular Restricted Boltzmann Machine (RBM) implementation using Theano

GNU General Public License v3.0

172 stars 41 forks source link

the batch size in the image (2) at run time is different than at build time (10) for the ConvOp #3

Open tumble-weed opened 10 years ago

tumble-weed commented 10 years ago

I can't figure out how to correct this error:

ValueError: the batch size in the image (2) at run time is different than at build time (10) for the ConvOp. Apply node that caused the error: ConvOp{('imshp', (1, 28, 28)),('kshp', (28, 28)),('nkern', 50),('bsize', 10),('dx', 1),('dy', 1),('out_mode', 'valid'),('unroll_batch', 5),('unroll_kern', 2),('unroll_patch', False),('imshp_logical', (1, 28, 28)),('kshp_logical', (28, 28)),('kshp_logical_top_aligned', True)}(Subtensor{int64:int64:}.0, Subtensor{::, ::, ::-1, ::-1}.0) Inputs shapes: [(2, 1, 28, 28), (50, 1, 28, 28)] Inputs strides: [(6272, 6272, 224, 8), (6272, 6272, -224, -8)]

I made mb_size=1, and it works but i'd like to work with larger minibatches, so can you please help me out here?

benanne commented 10 years ago

Hi, I will need some more information to be able to help you with this, what was the code that resulted in this error? It would be useful to have a snippet that reproduces this. Where did you put "mb_size=1"?

Sander

tumble-weed commented 10 years ago

Hi, sorry for the late reply. I understood my error and rectified it.

However, I had a query, is the sparsity updater implemented same as in Sparse deep belief net model for visual area V2http://ai.stanford.edu/~ang/papers/nips07-sparsedeepbeliefnetworkv2.pdf i.e. it can be used in the example_convolutionalRBM to implement Honglak Lee's Convolutional RBM results?

Thanks for your help

On Fri, Feb 14, 2014 at 1:14 AM, Sander Dieleman notifications@github.comwrote:

Hi, I will need some more information to be able to help you with this, what was the code that resulted in this error? It would be useful to have a snippet that reproduces this. Where did you put "mb_size=1"?

Sander

Reply to this email directly or view it on GitHubhttps://github.com/benanne/morb/issues/3#issuecomment-35017482 .

benanne commented 10 years ago

The implementation follows the formulation in "Biasing RBMs to manipulate latent selectivity and sparsity" by Goh et al, 2010. The sparsity penalty is the cross entropy between the activations and the sparsity target, which in my opinion is the most natural way to go about this.

The penalty introduced by Lee et al. is different afaik, they minimise the mean squared error between the activations (averaged across the batch) and the sparsity target instead. So if you want to reproduce their results exactly, you will have to use a different updater. That said, the SparsityUpdater is 10 lines of code so it shouldn't be too hard to adapt this :) If you encounter any issues in this feel free to contact me and I'll take a look.

tumble-weed commented 10 years ago

Hi, I modified the Sparsity updater you had in your code. I would be much reassured of my implementation if you could give your opinion on whether it is correct or not.

This if for the equation in Sparse deep belief net model for visual area V2http://ai.stanford.edu/~ang/papers/nips07-sparsedeepbeliefnetworkv2.pdf [1] _sparsitytarget is now simply a number like 0.02, (like the value used in Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations [2] by Lee) which is the minibatch sparsity of a unit i am aiming for.

hm0_given_v0 is the E[h|(v=data)], i.e. expectation of the hidden activities given the data

sp_u is the expanded out regularizer according to the paper [1]

_class SparsityUpdaterLee(Updater):

def init(self, rbm, variable, sparsity_target, stats):*
sparsity_targets is a dict mapping Units instances to their

target activations*
super(SparsityUpdater_Lee, self).init(variable, [stats])*
self.stats = stats*
self.rbm = rbm*

 self.sparsity_target = sparsity_target*

def get_update(self): *
vmap = self.stats['data'].copy()*
hm0_given_v0 = self.rbm.mean_field([self.rbm.h],vmap)[self.rbm.h]#using rbm sampling on the hidden_units*
sp_u=T.sum(*
self.sparsity_target2+ T.mean( hm0_given_v0,axis=0)2- 2_self.sparsity_target_T.mean( hm0_given_v0,axis=0)*
)*
return -T.grad(sp_u,self.variable,disconnected_inputs='ignore')*

This will be run as (In the portion of the code where updates are defined):

pu = var + 0.000001 * updaters.CDUpdater(rbm, var, s)

_su= 0.000001_sparsity_cost_updaters.SparsityUpdater_Lee(rbm, var, sparsitytarget,s)

umap[var] = pu+su

Also I was reading through the code, and was having trouble understanding the purpose behind Proxy Units and Functions like complete_unit_list. Is there any place (some forum etc) where you might have discussed these implementation details, and i can have a look.

Thanks in advance.

On Fri, Feb 21, 2014 at 2:14 PM, Sander Dieleman notifications@github.comwrote:

The implementation follows the formulation in "Biasing RBMs to manipulate latent selectivity and sparsity" by Goh et al, 2010. The sparsity penalty is the cross entropy between the activations and the sparsity target. in that case, which in my opinion is the most natural way to go about this.

The penalty introduced by Lee et al. is different afaik, they minimise the mean squared error between the activations (averaged across the batch) and the sparsity target instead. So if you want to reproduce their results exactly, you will have to use a different updater. That said, the SparsityUpdaterhttps://github.com/benanne/morb/blob/master/morb/updaters.py#L59is 10 lines of code so it shouldn't be too hard to adapt this :) If you encounter any issues in this feel free to contact me and I'll take a look.

Reply to this email directly or view it on GitHubhttps://github.com/benanne/morb/issues/3#issuecomment-35708778 .

benanne commented 10 years ago

That looks like it should work. You can probably drop the term self.sparsity_target**2 since it disappears after taking the gradient anyway.

Documentation is a work in progress, I hope to write an overview of how all the parts fit together soon. ProxyUnits in particular are useful for unit types which generate more than one term in the energy function.

For example, if v are Gaussian units, there will be a term in v and a term in v^2 in the energy function. For standard Gaussian units, the v^2 term has no parameters associated with it, so the only purpose of the ProxyUnits in that case is to make the computed energies correct. For Gaussian units with learnt precision however, there will be parameters associated with both terms, so we need separate Units instances to be able to associate parameters with each one.

Have a look at LearntPrecisionGaussianBinaryRBM to see an example of this. Wm are the weights associated with the v term (mean, as in any normal RBM), and Wp are the weights associated with the v^2 term (precision).

tumble-weed commented 10 years ago

Hi, sorry to bother you but I had a query about gaussian and learnt gaussian units. I was reading Prof. Hinton's guide to training RBMshttp://www.cs.toronto.edu/~hinton/absps/guideTR.pdf I was the impression that (unlearnt) gaussian units were simply visible units where the sigma and the mean were fixed. In case I am correct in my assumption, what are the fixed values for these parameters, as in where can I find these in the code?

In case I am not correct about their function, what exactly are gaussian units, could you recommend some publication where I could understand their use.

I think i am basically confused about how to write the energy equation in terms of Wm Wp bvm and bvp , and correlate this with the equation presented in Prof. Hinton's paper. Could you help me out here?

Thank you for your time

On Sat, Feb 22, 2014 at 4:59 PM, Sander Dieleman notifications@github.comwrote:

That looks like it should work. You can probably drop the term self.sparsity_target**2 since it disappears after taking the gradient anyway.

Documentation is a work in progress, I hope to write an overview of how all the parts fit together soon. ProxyUnits in particular are useful for unit types which generate more than one term in the energy function.

For example, if v are Gaussian units, there will be a term in v and a term in v^2 in the energy function. For standard Gaussian units, the v^2 term has no parameters associated with it, so its only purpose is to make the computed energies correct. For Gaussian units with learnt precision however, there will be parameters associated with both terms.

Have a look at LearntPrecisionGaussianBinaryRBMhttps://github.com/benanne/morb/blob/master/morb/rbms.py#L90to see an example of this. Wm are the weights associated with the v term (mean, as in any normal RBM), and Wp are the weights associated with the v^2 term (precision).

Reply to this email directly or view it on GitHubhttps://github.com/benanne/morb/issues/3#issuecomment-35800518 .

benanne commented 10 years ago

'Regular' gaussian units have a mean which is dependent on the input. It is definitely not fixed, else you couldn't really learn much with them :)

Regarding fixed variance, there are two flavours: sometimes the variance is just assumed to be fixed at 1.0 (which is true for the Morb implementation), but of course this is not desirable if you want to sample from the model. So this approach only really works if you're using a mean field approximation.

The other option is to give each unit a learnt, but constant variance (constant in the sense of not depending on the input). These are the sigma_i in section 13.2 of Hinton's guide. This is not implemented in Morb at the moment, because they require a bit of a different treatment than other parameters.

Regarding the second part of your question, writing the energy equation in terms of Wm, Wp, bvm and bvp is only possible for Gaussian units with learnt precision, which are parameterised so that the variance (= 1/precision) of each unit also depends on the input, i.e. it is not constant across the entire dataset. The Gaussian RBM that Hinton talks about has no Wp or bvp parameters.

For more information on Gaussian RBMs with learnt precision, check out Section 2 of Le Roux et al., 2011, where this model was introduced. They describe a number of different RBM models in order of increasing complexity and give energy functions and conditional distributions for each. You can find it here: http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00086

tumble-weed commented 10 years ago

Hi,

Thanks for your prompt reply, saved me a lot of time!

Also I have been trying the convolutional example on MNIST. While it seems to work well when the receptive field (filter dimensions) are 28x28 (size of the full MNIST image), it gives strange reconstructions when I change the dimensions to 8x8.

In fact I have to reduce to learning rate greatly, to a value of 0.00001, for the reconstructions to look reasonable (like digits) at the starting. May I ask if you also obtain similar results?

At the bottom of the mail is one of my initial reconstructions for the digit 8. All the RBM parameters are as in the example except the filter dimensions being set to 8x8, and mb_size to 1. It doesn't anything like digits by the 10th epoch

Also I thought I should ask. I recently came across Prof. Lee's CRBM code in Matlab. It is quite slow, but upon reading it I found something I haven't seen used before.

Before calculating the probabilities of the hidden layer he is multiplying their input activities by a factor between 50 and 100. (He varies this factor from 50 to 100 in about 70 epochs,and it is constant from that point) . The strangest thing was when I removed this multipliying factor, I was unable to obtain any sort of sharp filters. This was true for the Kyoto natural images he trained on, as well from MNIST. I am wondering as to the theoretical justification behind this. What would be your thoughts on this?

In case you want to have a look I have attached it herewith. The relevant function in which this happens is _tirbminference in the line _poshidexp2(:,:,b) = 1/(pars.stdgaussian^2).(poshidexp2(:,:,b) + hbias_vec(b));*

Thanks [image: Inline image 1]

On Sun, Mar 16, 2014 at 4:45 PM, Sander Dieleman notifications@github.comwrote:

'Regular' gaussian units have a mean which is dependent on the parameters. It is definitely not fixed, else you couldn't really learn much with them :)

Regarding fixed variance, there are two flavours: sometimes the variance is just assumed to be fixed at 1.0 (which is true for the Morb implementation), but of course this is not desirable if you want to sample from the model. So this approach only really works if you're using a mean field approximation.

The other option is to give each unit a learnt, but constant variance (constant in the sense of not depending on the input). These are the sigma_i in section 13.2 of Hinton's guide. This is not implemented in Morb at the moment, because they require a bit of a different treatment than other parameters.

Regarding the second part of your question, writing the energy equation in terms of Wm, Wp, bvm and bvp is only possible for Gaussian units with learnt precision, which are parameterised so that the variance (= 1/precision) of each unit also depends on the input, i.e. it is not constant across the entire dataset. The Gaussian RBM that Hinton talks about has no Wp or bvp parameters.

For more information on Gaussian RBMs with learnt precision, check out Section 2 of Le Roux et al., 2011, where this model was introduced. They describe a number of different RBM models in order of increasing complexity and give energy functions and conditional distributions for each. You can find it here: http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00086

Reply to this email directly or view it on GitHubhttps://github.com/benanne/morb/issues/3#issuecomment-37754490 .

benanne commented 10 years ago

In fact I have to reduce to learning rate greatly, to a value of 0.00001, for the reconstructions to look reasonable (like digits) at the starting. May I ask if you also obtain similar results?

I honestly can't say, I haven't toyed with this code (or in fact MNIST) in quite a while. Watch out with the 'reasonable reconstructions' criterion though, this also depends on the mixing rate of the alternating Gibbs chain used to generate reconstructions, and this can be completely unrelated to how well the data is modeled. I discussed this in more detail here: http://metaoptimize.com/qa/questions/12592/monitoring-the-training-of-restricted-boltzmann-machines

Before calculating the probabilities of the hidden layer he is multiplying their input activities by a factor between 50 and 100. (He varies this factor from 50 to 100 in about 70 epochs,and it is constant from that point) . The strangest thing was when I removed this multipliying factor, I was unable to obtain any sort of sharp filters. This was true for the Kyoto natural images he trained on, as well from MNIST. I am wondering as to the theoretical justification behind this. What would be your thoughts on this?

I'm not sure if there is any theoretical justification, this is the first time I hear of it at any rate. This is essentially just a reparameterisation, so the model stays the same, the parameters are just different. It does affect optimization though, the relative magnitude of the gradients will be different. So that might be the motivation behind it.

tumble-weed commented 10 years ago

Thank you for your helpful reply. Much appreciated, it has given me some food for thought.

Regarding Prof Lee's implementation, I ended up making a theano based Convolutional RBM, which seems to be working without his reparameterization. But your interpretation that it implies different learning rates is something I'll try out.

thanks

On Sat, Mar 29, 2014 at 1:36 AM, Sander Dieleman notifications@github.comwrote:

In fact I have to reduce to learning rate greatly, to a value of 0.00001, for the reconstructions to look reasonable (like digits) at the starting. May I ask if you also obtain similar results?

I honestly can't say, I haven't toyed with this code (or in fact MNIST) in quite a while. Watch out with the 'reasonable reconstructions' criterion though, this also depends on the mixing rate of the alternating Gibbs chain used to generate reconstructions, and this can be completely unrelated to how well the data is modeled. I discussed this in more detail here: http://metaoptimize.com/qa/questions/12592/monitoring-the-training-of-restricted-boltzmann-machines

Before calculating the probabilities of the hidden layer he is multiplying their input activities by a factor between 50 and 100. (He varies this factor from 50 to 100 in about 70 epochs,and it is constant from that point) . The strangest thing was when I removed this multipliying factor, I was unable to obtain any sort of sharp filters. This was true for the Kyoto natural images he trained on, as well from MNIST. I am wondering as to the theoretical justification behind this. What would be your thoughts on this?

I'm not sure if there is any theoretical justification, this is the first time I hear of it at any rate. This is essentially just a reparameterisation, so the model stays the same, the parameters are just different. It does affect optimization though, the relative magnitude of the gradients will be different. So that might be the motivation behind it.

Reply to this email directly or view it on GitHubhttps://github.com/benanne/morb/issues/3#issuecomment-38962459 .

benanne / morb

the batch size in the image (2) at run time is different than at build time (10) for the ConvOp #3

sparsity_targets is a dict mapping Units instances to their