Changing the conv memory layout

Qwlouse commented 9 years ago

We just briefly discussed this and I think it would be good to change the buffer layout for our convolutional layers from 1 to 2:

(time, batch_size, channels, height, width)
(time, batch_size, height, width, channels)
Pros
- Batch-normalization for conv layers would be easier, because there you want to estimate the mean/std over all time, batch_size, height, and width but not channels. And that you could then do with a single flatten operation.
- When writing an 2D-RNN or 2D-LSTM implementation, it would then probably be possible to do the matrix multiplication for the recurrent matrices with a single BLAS call over all images in the batch, because only a single striding is needed, and not two.
- It makes more sense from a memory layout perspective, because each filter of a convolution uses all channels but not the entire width/height.
- OpenCV and apparently caffe2 use that format
  Cons
- Caffe uses format 1
- We need to change our code :-(

flukeskywalker commented 9 years ago

I think nervanagpu also uses NHWC. I vote for switching.

untom commented 9 years ago

Sounds like changing has more advantages than disatvantages, so the change is fine by me.

flukeskywalker commented 9 years ago

@untom Can we replace the Numpy conv implementation with a Cython one too then (probably forked from the same place)? It's probably not worth rewriting the version with fancy indexing for now.

flukeskywalker commented 9 years ago

It looks like the following changes are needed:

Modify CPU ops for conv and pooling, both forward and backward.
Switch tensor format for cuDNN to CUDNN_TENSOR_NHWC from CUDNN_TENSOR_NCHW.
Modify the data preparation script for CIFAR-10

Is there anything else?

untom commented 9 years ago

I tried using the conv implementation with a Cython one (the one used in cudarray), but it was much slower than the current im2col based one.

flukeskywalker commented 9 years ago

I started to make this change, but it looks like cuDNN v3 still does not support NHWC fully. In particular, the docs cudnnPoolingForward show CUDNN_STATUS_NOT_SUPPORTED for the case when The wStride of input tensor or output tensor is not 1.

There is a forum comment saying that cuDNN v4 will support it here: https://devtalk.nvidia.com/default/topic/875570/?comment=4680667. I'm not sure if this is official.

Our choices are:

Stick with NCHW
Support both NCHW and NHWC
Switch to NHWC and use pooling CUDA kernels.

Qwlouse commented 9 years ago

I don't like NCHW. IMHO its just a poor choice. So my vote is still for switching to NHWC. The forum post indicates that the conv backward pass might also not work with NHWC, which might be a show-stopper. In that case we can maybe wait for the cuDNN v4 release. But other than that I suggest option 3.

untom commented 8 years ago

Sooo.... I guess this is not actionable until cuDNN v4 is out? Do they have any release date? (in the linked devtalk thread it just says "soon" o_0 ).

flukeskywalker commented 8 years ago

If it's just about pooling, we can use our own CUDA kernels for it. If some aspects of conv are also not supported, then we have to wait until cuDNN v4.

BTW, there seem to be some cuDNN specific things like conv mode, pref, workspace etc. which we're not doing, and probably should. Also, do you know where is the latest cudnn wrapper code? Github appears out-of-date.

untom commented 8 years ago

As far as I understood from the linked threads, convolution with NHWC is not supported in cuDNN right now.

Wrapper code: seems to be stuck at cuDNN v2, we'll have to update them ourselves (or roll our own) -- in which case I'd rather wait for cuDNN v4 instead of doing the same work twice.

flukeskywalker commented 8 years ago

The threads don't really appear to be official. As far as I can see in the cuDNN docs, only the pooling does not support NHWC. For conv, I know that at least the forward pass supports it (I tried to migrate to NHWC in a private branch), although I am not sure if all conv algos are supported. Also, I did not try to run the conv backward pass. The conv docs say nothing about NHWC not being supported though.

I'll try to find out when cuDNN v4 is expected to be out.

flukeskywalker commented 8 years ago

Any issues with switching the convention for convolution layer weights to NHWC as well?

flukeskywalker commented 8 years ago

I have finished this migration in: https://github.com/Qwlouse/brainstorm/tree/NHWC_format

Firstly, the CIFAR-10 example is now much slower on the GPU. I think this is expected due to the small network and image sizes. We should use prepared calls to the kernels for a small speed-up.

Secondly, the CIFAR-10 network learns slower now. I am not sure why this is. I get to a loss of ~1.25 in 10 epochs now, whereas with cuDNN it was around ~1.0 after 10 epochs. After 10 epochs learning remains slow with about 63% val accuracy after 20 epochs (vs. 71% for cuDNN+NCHW).

flukeskywalker commented 8 years ago

BTW, this branch also contains the SoftmaxCE layer, which replaces Classification layer in the helper tool for building classifiers as well.

Update It appears that using Classification instead of Fully connected + the new SoftmaxCE gives the same performance as cuDNN, which implies that the issue is in the usage of SoftmaxCE somewhere. This should be easy to fix.

flukeskywalker commented 8 years ago

Alright, I fixed the issue that was causing worse performance. This issue is basically solved now. Summary

We now use im2col and col2im from caffe2 to do convolutions.
cudnn is no longer required for convolution/pooling.
cudnn ops are in a separate handler BUT cuDNN v3 does not support NHWC currently, so this handler can not be used. This handler will become operational with the release of cudnn v4.
Small CNNs like the CIFAR example run much slower compared to cudnn (20x to 25x). However, with larger networks and images (such as AlexNet), speed will be much better (similar to Caffe/Caffe2).

flukeskywalker commented 8 years ago

We would like to make NHWC the standard branch going forward, but I'd like also keep around support for NCHW+cudnn v3 in a separate branch until cudnn v4 comes out. What should be the strategy?

Qwlouse commented 8 years ago

I see two options:

Create a legacy branch from the current master, and only then merge the NHWC changes. We can keep that branch around. But keeping it up-to-date will be tricky and probably involve some cherry-picking.
Create a release by tagging the master. That we won't even try to keep up-to-date, but you can always go back to use this (pre-)release.

I favor the 2nd.

flukeskywalker commented 8 years ago

I'm currently keeping the NHWC branch up to date by merging master into it. Doing an NHWC pre-release is fine, but it should keep most API changes. One thing that's missing from master but is in NHWC_format is SoftmaxCE layer.

Let's move the SoftmaxCE layer and associated changes to master, and finish any planned API changes soon (in a week). Then, as you said, we tag a release and merge the branch in. Okay?

Qwlouse commented 8 years ago

Fine with me

Qwlouse commented 8 years ago

I've cherry-picked the SoftmaxCE layer from the NHWC branch.

flukeskywalker commented 8 years ago

Caution: The fix I later made to get_in_out_layers_for_classification has not been picked yet. The activation for the fc_layer needs to be set to 'linear' (it's relu by default). Also, I suggest renaming this tool to get_classifier_layers (or something like that).

Qwlouse commented 8 years ago

Yay! Issue hijacking! :tada:

Ok I'll fix that. I've also just added a SigmoidCE layer and did lots of copy&pasting for the tool. Maybe its time to unify the three tools we have...

flukeskywalker commented 8 years ago

Transition to NHWC has been finished and merged into master. NCHW is unsupported after the 0.4 release.

Support for cuDNN will return in a future release after cuDNN v4 is released with full support for NHWC.

The 0.4 release can be used if one wishes to use cuDNN v3 with brainstorm for NCHW format.

IDSIA / brainstorm

Changing the conv memory layout #33

Pros

Cons