Closed lucasb-eyer closed 9 years ago
Actually, there may be. Here's a suggestion, let me know what you think:
params
attribute to Module
which is a list of learnable parameters (and their gradients), usually containing just weight
and possibly bias
. The parameters
function will use the params
list._addparam(name, shape, init)
private method to Module
which:
init
, which could be a string like "Xavier" and "PReLU", a number like 0.01 which will be multiplied by np.random.standard_normal(), or an array which will be used as-is.params
list.reset
method in Module
, which just resets each param
with its initialization. Linear
and Conv
layers will not need to implement reset
and BatchNormalization
will implement it for zeroing stats but use Module.reset
for its weight/bias parameter.If you like the general idea, I can come up with a first implementation and we can further iterate on it, as I'm pretty sure that I'm still missing some culprits. Alternatively, we can stop and keep it at what this PR does for now.
One thing I missed is that to truly reset the model, it would also need to delete (or reset) the optimizer's stored state, which either induces recompiling (in case of deleting) thus avoiding the advantage of a reset alltogether, or the reset needs to know the optimizer (in case of reset, only the optimizer itself knows how to reset its state).
I thought more about it, and there's only really one true advantage of having reset
that I could think of: Avoid recompilation during multiple runs of the same model+optimizer combo. Recompilation is not even that bad since Theano caches compilation results, only ~3-5s.
So, for me, we could also get rid of reset
to reduce complexity (and thus potential of bugs). What's your opinion?
Everything is fine but I would suggest to use function/classes instead of hard printed names, so we could reduce repetitions of the code.
For example,
def xavier(in, out, w=1, d=1): return don't remember the exact formula
model.add(bb8.Linear(..., ..., initialize_with=bb8.inits.xavier)
Here is my proposal. I tested with all current layers that have weights, i.e. Linear
, BatchNormalization
, and SpatialConvolution/CUDNN
using the MNIST example, but as usual, I wasn't able to test on Py2.
The main changes are:
reset
due to aforementioned reasons._add_param
to Module
and make use of it, including your suggestion of a function for specifying initalization.
weight
and bias
names anymore; learnable parameters need to be created using _add_param
. There's a leading underscore in the name because it's only intended to be used by subclasses.parameters()
to learnable_parameters()
.I'm open to comments/feedback.
Only now do I remember why you originally wanted to have them weight
and bias
, because they often need to be treated separately, e.g. for L1, L2, and max-norm regularization, distillation, and probably many more. So it was too naive of me to remove this distinction (in the sub-point of 2.) and I'll add it back in after some quality sleep.
The changes with parameters make code much less transperent without giving any obvious improvement. Please roll back changes regarding them and I will merge.
The improvements I saw were 1. Layers don't need to remember creating the shared variables and 2. no "magic" attribute names. 2 just bothers me, but in practice it's not even a real problem. In retrospect, you're right it's not worth the increase in complexity just for 1 as we can just aswell add an assertion with a nice message at compile-time. I'll revert these changes soon and let you know.
But 1 and 2 are not real problems. However, it increases the complexity of design and makes much harder for potential users to create new modules. Now they need to see what do these functions do and so on. So they need to have much deeper understanding of design than before.
Also it makes it different from Torch that is also potentially confusing.
That's ok if you want to add some helper functions for creating shared variables. But we should avoid replacing Torch's design for module's parameters.
Yep, yep, I already agreed in the previous comment :) I'll roll-back these parts of the commit hopefully tomorrow.
There we go, I hope github didn't send out an email for every single of my force-pushes, or your mailbox is flooded by now :laughing:
So now we're back to the original implementation of parameters, with two added utility functions to make creation and initialization of parameters easier and more uniform.
Initializations are done by functions as you suggested. More initializers are queued for another PR as soon as this one is through. I'm quite satisfied now, any more feedback from your side?
Looks great!
While writing the optimizers example, I found out that
reset
doesn't work as we intended to, for two reasons:This PR fixes both and I tested that it works for all but the CUDNN layer, since I can't test it on my laptop. But it should be exactly the same as the non-CUDNN conv layer.
I reckon the code is somewhat verbose and maybe even non-obvious. But I couldn't think of a better, non-convoluted, way. I'd really like to keep
reset()
as I think it's very useful for multiple-run experiments, which I always do in order to see if an improvement is significant or not, because withreset()
there's no need to recompile anything between runs.