[Renamed] Remove `reset` and rework inits.

lucasb-eyer commented 9 years ago

While writing the optimizers example, I found out that reset doesn't work as we intended to, for two reasons:

It didn't get forwarded by containers
It always created a new shared variable, instead of resetting its value if it already exists.

This PR fixes both and I tested that it works for all but the CUDNN layer, since I can't test it on my laptop. But it should be exactly the same as the non-CUDNN conv layer.

I reckon the code is somewhat verbose and maybe even non-obvious. But I couldn't think of a better, non-convoluted, way. I'd really like to keep reset() as I think it's very useful for multiple-run experiments, which I always do in order to see if an improvement is significant or not, because with reset() there's no need to recompile anything between runs.

lucasb-eyer commented 9 years ago

Actually, there may be. Here's a suggestion, let me know what you think:

Add a params attribute to Module which is a list of learnable parameters (and their gradients), usually containing just weight and possibly bias. The parameters function will use the params list.
Add an _addparam(name, shape, init) private method to Module which:
- creates the shared variable for the parameter of given shape,
- initializes it with init, which could be a string like "Xavier" and "PReLU", a number like 0.01 which will be multiplied by np.random.standard_normal(), or an array which will be used as-is.
- creates the shared variable for the parameter's gradient of same shape, init=0,
- adds the pair to the params list.
Then we can have a generic reset method in Module, which just resets each param with its initialization. Linear and Conv layers will not need to implement reset and BatchNormalization will implement it for zeroing stats but use Module.reset for its weight/bias parameter.

If you like the general idea, I can come up with a first implementation and we can further iterate on it, as I'm pretty sure that I'm still missing some culprits. Alternatively, we can stop and keep it at what this PR does for now.

lucasb-eyer commented 9 years ago

One thing I missed is that to truly reset the model, it would also need to delete (or reset) the optimizer's stored state, which either induces recompiling (in case of deleting) thus avoiding the advantage of a reset alltogether, or the reset needs to know the optimizer (in case of reset, only the optimizer itself knows how to reset its state).

I thought more about it, and there's only really one true advantage of having reset that I could think of: Avoid recompilation during multiple runs of the same model+optimizer combo. Recompilation is not even that bad since Theano caches compilation results, only ~3-5s.

So, for me, we could also get rid of reset to reduce complexity (and thus potential of bugs). What's your opinion?

ikostrikov commented 9 years ago

Everything is fine but I would suggest to use function/classes instead of hard printed names, so we could reduce repetitions of the code.

For example,

def xavier(in, out, w=1, d=1): return don't remember the exact formula

model.add(bb8.Linear(..., ..., initialize_with=bb8.inits.xavier)

lucasb-eyer commented 9 years ago

Here is my proposal. I tested with all current layers that have weights, i.e. Linear, BatchNormalization, and SpatialConvolution/CUDNN using the MNIST example, but as usual, I wasn't able to test on Py2.

The main changes are:

Remove reset due to aforementioned reasons.
Add _add_param to Module and make use of it, including your suggestion of a function for specifying initalization.
- This means there's no special weight and bias names anymore; learnable parameters need to be created using _add_param. There's a leading underscore in the name because it's only intended to be used by subclasses.
Make the difference between "learnable" parameters, i.e. parameters with gradients (such as weight and bias) and parameters without gradients (such as BN's inference weight and bias). The reason is to 1) share code and 2) when saving a model, we'll want to also save the parameters without gradients, this makes it possible in a generic way. To clarify this difference in the code, I renamed parameters() to learnable_parameters().

I'm open to comments/feedback.

lucasb-eyer commented 9 years ago

Only now do I remember why you originally wanted to have them weight and bias, because they often need to be treated separately, e.g. for L1, L2, and max-norm regularization, distillation, and probably many more. So it was too naive of me to remove this distinction (in the sub-point of 2.) and I'll add it back in after some quality sleep.

ikostrikov commented 9 years ago

The changes with parameters make code much less transperent without giving any obvious improvement. Please roll back changes regarding them and I will merge.

lucasb-eyer commented 9 years ago

The improvements I saw were 1. Layers don't need to remember creating the shared variables and 2. no "magic" attribute names. 2 just bothers me, but in practice it's not even a real problem. In retrospect, you're right it's not worth the increase in complexity just for 1 as we can just aswell add an assertion with a nice message at compile-time. I'll revert these changes soon and let you know.

ikostrikov commented 9 years ago

But 1 and 2 are not real problems. However, it increases the complexity of design and makes much harder for potential users to create new modules. Now they need to see what do these functions do and so on. So they need to have much deeper understanding of design than before.

Also it makes it different from Torch that is also potentially confusing.

That's ok if you want to add some helper functions for creating shared variables. But we should avoid replacing Torch's design for module's parameters.

lucasb-eyer commented 9 years ago

Yep, yep, I already agreed in the previous comment :) I'll roll-back these parts of the commit hopefully tomorrow.

lucasb-eyer commented 9 years ago

There we go, I hope github didn't send out an email for every single of my force-pushes, or your mailbox is flooded by now :laughing:

So now we're back to the original implementation of parameters, with two added utility functions to make creation and initialization of parameters easier and more uniform.

Initializations are done by functions as you suggested. More initializers are queued for another PR as soon as this one is through. I'm quite satisfied now, any more feedback from your side?

ikostrikov commented 9 years ago

Looks great!

VisualComputingInstitute / Beacon8

[Renamed] Remove `reset` and rework inits. #15