Improve / Fix Weight Sharing - Githubissues

BVLC / caffe

Caffe: a fast open framework for deep learning.

http://caffe.berkeleyvision.org/

Other

34.12k stars 18.68k forks source link

Improve / Fix Weight Sharing #1211

Open shelhamer opened 10 years ago

shelhamer commented 10 years ago

Weight sharing as-is relies on a weight owner with which shared layers share their parameter blobs. This poses a few problems in relation to loss, loading and saving parameters, and weight initialization that are listed here for addressing.

[x] Fix incorrect momentum and history due to separation of shared weights #2866
[x] Fix the resuming / fine-tuning issue for shared weights; see https://github.com/BVLC/caffe/pull/959#issuecomment-55500404. Done in #594 as it turns out.
[x] Determine if there is actually a loss / weight ownership issue as asked at https://github.com/BVLC/caffe/pull/546/files#r16817721 by @ashafaei. [No, there is not –shelhamer]
[x] Save memory through accumulation #1977 by sharing diffs #2866
[ ] Load and save only the owned weights and not shared duplicates #2836 for hdf5
[ ] Figure out how snapshot / restore should resolve by layer or param name and fallback as needed
[ ] Only the owner should initialize weights. Currently unnecessary work and memory is expended filling all weights, and then these are discarded to share with the weight owners.
[ ] Die if weight fillers are defined in layers that don't own their parameters (the weights are properly initialized in this case, but only by ignoring the incorrect specification as written).

@jeffdonahue @longjon

jeffdonahue commented 10 years ago

Fix the resuming / fine-tuning issue for shared weights; see #959 (comment). Done in #594 as it turns out.

I just pushed a unit test for resuming from saved weights (4dc5bd0). It passes as expected, but fails when cherry-picked from 8dac339, before #594 was merged. Glad this was magically fixed, thanks @longjon!

ducha-aiki commented 10 years ago

Would you consider the tied weights also? i have tried to implement them by myself, but with current weight sharing scheme it seemed too complicated.

rodrigob commented 10 years ago

@ducha-aiki what is the difference between tied weights and shared weights ?

@shelhamer I can look into dying if fillers are defined where parameters are shared; if you tell me what is the "caffe way of dying" (LOG(FATAL) and then ?). Also, as example, for InnerProductLayer, can you share bias without sharing the product weights ?

ducha-aiki commented 10 years ago

@rodrigob Tied weights are used in autoencoders. If encoder weights = W, then decoder weights = W^T, i.e transposed ones. https://groups.google.com/forum/#!topic/theano-users/QilEmkFvDoE

shelhamer commented 10 years ago

@ducha-aiki @rodrigob autoencoder-style shared weights are already possible by Caffe weight sharing if the blobs are shared with PERMISSIVE dimensionality checking:https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L273-L281 and the transpose shape is defined in the deconv layers.

While blobs can be shared permissively so that they have the same total cardinality but different dimensions this doesn't cover everything for W, W^T pairs since the input-output swapped inner product weights aren't the transpose.

ducha-aiki commented 10 years ago

@shelhamer but the weights have different order in transposed matrix. I will check again, but when I have tried, that did not worked.

jeffdonahue commented 10 years ago

Yeah, it would not work for pairs of inner product layers where the weights are transposed (using permissive would probably give very bad results). It would require a little bit of additional implementation -- probably the easiest would be to add a "transposed weights" option to the inner product layer so that the layer pair could use the same weight matrix.

ducha-aiki commented 10 years ago

@jeffdonahue This is easy. The real problem are diffs, since they have not only different shape, but number of elements.

jeffdonahue commented 10 years ago

What? Why would the diffs be a different number of elements? I think I'm missing something...

ducha-aiki commented 10 years ago

@jeffdonahue Because size of diff == size of output. An example from MNIST autoencoder: name: "MNISTAutoencoder" input: "data" input_dim: 1 input_dim: 1 input_dim: 28 input_dim: 28 layers { bottom: "data" top: "encode1" name: "encode1" type: INNER_PRODUCT inner_product_param { num_output: 1000 } } layers { bottom: "encode1" top: "decode1" name: "decode1" type: INNER_PRODUCT inner_product_param { num_output: 784 } }

jeffdonahue commented 10 years ago

Right, the encode1 weights are 1000x784 (producing 1000D outputs from 784D inputs) and the decode1 weights have the transposed dimension, 784x1000 (producing 784D outputs from 1000D inputs). The weight gradients are the same dimension by definition.

shelhamer commented 9 years ago

We should keep #1659 in mind too.

yosipk commented 9 years ago

Mocha has TiedInnerProductLayer [http://mochajl.readthedocs.org/en/latest/user-guide/layers/computation-layer.html#TiedInnerProductLayer, source: https://github.com/pluskid/Mocha.jl/blob/master/src/layers/tied-inner-product.jl], I guess Caffe could be similar, along the lines of @jeffdonahue suggestion to add a "transposed weights" option to the inner product layer.

raingo commented 9 years ago

Do we have an update on these?

Shared weights are very important for recurrent nets.

Jim61C commented 7 years ago

Hi, Do we have an update on the 7th problem mentioned above?

"Only the owner should initialize weights. Currently unnecessary work and memory is expended filling all weights, and then these are discarded to share with the weight owners."

I am currently facing a problem having multiple FC layers sharing weights due to memory issue and I believe that it is due to the fact that even if I share weights between those FC layers, they are still being initialized and take extra memory at the creation of the network, any idea on workaround of this will be greatly appreciated!

Thanks!