Handler-specific storage requirements in layers

untom commented 9 years ago

While implementing convolutions/pooling, I've stumbled upon the following problem: Depending on how you implement the operations, you might need more/less storage. Specifically, the GPU and CPU implementations might need different amounts of it / might not need it at all. Currently I have two examples for this:

cudnn needs a ton of descriptors for each convolution/pool-operation. These need to be allocated before calling the cudnn-convolution ops, and de-allocated right after. Which I'm assuming incurs some runtime cost -- haven't measured it though. I once asked an nvidia guy about it, he said the operations are pretty cheap, but I'm still assuming there's a malloc/free involved. These descriptors don't change within one layer, i.e., layerX will always construct/deallocate the same descriptors over and over. Additionally, some cudnn-convolution operations use a "workspace" memory that currently also needs to be allocated/deallocated after each conv-operation.
When implementing max-pooling, one could remember the "argmax" (i.e., which coordinate in the current window is the maximum) during the forward path, and re-use that information in the backward pass. This would significantly cut down runtime on the backward pass, at the cost of some more memory. However we can only do this in the CPU-version (since we can't change cudnn, who apparently doesn't follow this scheme).

In both cases, one of the two handlers needs additional storage, while the other doesn't. What's even weirder: the argmax can be seen as a buffer, and could be handled by the buffer manager. However, that'd lead to wasting memory on the GPU, where we'd allocate the buffer but never use it (which might also be confusing to users who inspect these buffers expecting them to mean something). The descriptors OTOH are cudnn-specific structures and probably not meant to be stored in buffers.

I can think of twho solutions

Add something like handler.allocate_pool/conv_specific_memory(...) that returns some sort of opaque datastructure (maybe a list of descriptors, allocations), which is then stored within each layer and always passed to the conv/pooling methods...

  # in layer ctor:
  self._pooling_data = self.handler.allocate_pool_specific_memory()`

  # in forward path
  def forward_pass(...):
        # each specific handler implementation is free to ignore the last argument if he doesn't need it
        self.handler.conv2d_forward_batch(inputs, window, outputs, pad, stride, self._pooling_data)

Allocate/deallocate cudnn-specific stuff in each call, and make "argmax" an internal buffer of the pooling layer

I'm not superhappy with either solution, since both are slightly ugly. I like solution 1 slightly more, but it has the additional problem of making the API a bit more complicated. What do you guys think?

flukeskywalker commented 9 years ago

How about this strategy:

For internals such as argmax, we do allocate memory regardless of handler. Even if the handler is PyCuda, we might have our own kernels, or some other library for doing pooling which does use the memory. The library is not focused on extreme efficiency I'd say, so for now we don't need to worry about this. We can see if this becomes an issue for training our models, and think of something like approach 1 above.
For cudnn descriptors, I'd say the same. Let's create/destroy them every time for now. Then we should profile and see if it's worth optimizing this. If so, we could try caching them in the handler, for example, so that the code for the layers remains simple and handler-unaware.

untom commented 9 years ago

One problem with that: argmax contains integers. However, all of our internals are assumed to be floating point numbers, and there currently is no way to request a different dtype.

flukeskywalker commented 9 years ago

Yes, these and dropout masks for examples will have to be float for now. We can discuss plans to work around this in the future, but it will involve additional kernels, for example.

IDSIA / brainstorm

Handler-specific storage requirements in layers #29