NervanaSystems / neon

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware
http://neon.nervanasys.com/docs/latest
Apache License 2.0
3.87k stars 811 forks source link

Grouped convolutions #296

Open ibmua opened 8 years ago

ibmua commented 8 years ago

Would you consider adding grouped capability to your convs considering that structures like ResNets can hugely benefit from it GPU computation-wise by moving some of their height into grouped width and therefore becoming much more parallelizable without at least any serious loss in learning potential? As far as I understand, it would be a very slight and easy change to make for a person familiar with the code. CuDNN in Torch and Caffe already have this feature. Though, relatively poorly implemented, as far as my checks on their performance go.

thestew42 commented 8 years ago

Are you describing the caffe feature described here (http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1ConvolutionLayer.html) under the 'group' parameter?

My understanding is that the input channels, filters, and output channels are divided into groups and not connected between groups. This would probably involve some change in the parameters to the kernel and memory indexing for inputs/filters/outputs. While these sound like minor changes, they could require more significant work since most of our convolution kernels are written in assembly, we may have to change register allocation and instruction ordering.

I would expect this to improve performance in cases where we don't have enough CUDA blocks to fill all SMs. This would be likely in layers where the size of the feature map is very small. In cases with medium-large feature maps, we will almost certainly have enough CUDA blocks to fill the SMs at any given time. This kind of optimization (at least the simple route I explained) will not improve performance in cases with a small number of feature maps. If you could provide profiling data and layer parameters to show the potential performance improvements, it would make it easier for us to make a decision on this feature.

Thanks, -Stewart

ibmua commented 8 years ago

Yes, it's that.

Something like

[input, output] += blockIdx.z * [input, output]_STRIDE (<---- "define" compiled into the kernel)

if compiled with groups>1 is what comes to my mind. But that's naive C and it also makes an assumption that blockIdx.z's not already used, which I guess may be wrong. So it'll be something more of a blockIdx.z%max_of_already_used but not naively so, as %mod is generally not the fastest operation on CUDA as far as I know, if max_of_already_used != 2^n. But generally speaking it should be just as simple even in SASS, I can't imagine why not. Just stride the input and output pointer by some value from blockIdx, that's the whole trick.

I don't know if I'm correct, but as far as I understand, neon's layers get compiled with several versions of parameters, as in how to partition threads, then tested and the fastest version is taken, right? While you can fill all, or a large part of SMs with threads and will be able to compute faster than without such smart partitioning of compute, most likely, this speed boost comes with efficiency sacrifices.

If you can elaborate on what kind of data I should provide, I'll be very happy to help you help me. =) If it's about some examples of what kind of convolutions one would use this to compute, I think a fast implementation of grouped convolutions where one could partition convs into like a 100-10000 blocks (this is extremely slow in all current implementations) could lead to discoveries of many more places where this can be used. But just stating some relatively obvious cases without going far into research space, some of the very obvious use cases would be an ensemble-style microstructure where one uses a stack of grouped convs and then merges their result, one very obviously working type of merge would be Max, or Avg against same feature index in each group. That's just a drop-in replacement for ordinary stacks of layers, like 3x3 conv going after 3x3 conv. You make the first one Nx wider and the next one N-grouped and then merge results of the groups and, presumably, get a better result and much faster training (yeah, I know, "why the faster training?", but it actually works like that). Also, one could change structure of the ResNet-type residuals to be more group-kind-of-wide like that and less serial if that would optimize calculation speed, without much loss in accuracy. I'm currently running some tests on that using WideResnet https://github.com/szagoruyko/wide-residual-networks Torch+CuDNN and will let you know how good of an idea the one with ResNets actually was.

Torch's CuDNN has groups, but it's very poor at them, especially with larger numbers of groups. For example, without any fancy mathematical optimizations I was able to implement 20x faster fully-grouped convolutions in Cuda vs CuDNN v5.1's and for many cases my CPU implementation is as fast as CuDNNs. Caffe's groups, on the other hand, last time I checked were implemented to an extremely poor performance. So my hope lies with Neon.

Also, one other meaningful parameter for grouped convs I can imagine is whether one wants to output resulting feature maps as 111122223333444 (usual), or 1234123412341234. This would provide for an ability to make less isolated stacks of grouped layers. How useful this would be is yet to be discovered, though, something one has to implement first to try. But another thing where it should definitely be useful is in giving ability for conventional volumetric pooling like in Torch's https://github.com/torch/nn/blob/master/doc/convolution.md#nn.VolumetricAveragePooling to pull results of the groups together. If someone would also make such a layer in neon, though, of course. =)

ibmua commented 8 years ago

So I ran a test with original WideResnet 2-40 and a modified version that has 3 residuals instead of 6, but has them in 2x groups combined with a 1x1 layer, because, sadly, I couldn't find a way to easily combine groups in Torch via Avg, or Max. And the results were actually better for that one. But there was a pretty strong problem that I didn't use weight decay, so now I'm retesting to get better results out of both. The likely results for the original are reported in their publication, but I want to first squeeze such results out of the trainer and then train grouped variant with same hyper-parameters. Also, lack of weight decay may have been the main driver in the aforementioned "speedup in training" in terms of epochs until convergence.. =/

thestew42 commented 8 years ago

Yes the changes are algorithmically simple and representing the logic in SASS is likely not a big deal. However our kernels very carefully use the available registers and in many cases we don't have extra registers to spare without changing lots of code. I would expect support for this with full testing to require about a week. To give you an idea, we would need to either allocate a register to store this group offset or compute it on the fly each time an address is needed (which will require a temporary register and preferably shift operation since mod is somewhat complicated in SASS). Additionally, we have numerous kernels for convolution which would each need to be updated, taking into consideration their individual properties.

In terms of profiling data, it would be good if you have a neon network that you can run with nvidia's profiler (nvprof) for several iterations and show that the SMs are indeed not full for certain layers. This would give us a good idea of what kind of workload is problematic and how these kernel changes will improve efficiency before we invest the effort in changing the kernels.

This sounds like an interesting feature that I would like to work on at some point, but my plate is full right now with our upcoming graph backend. I probably won't have time to work on this any time soon, but we welcome contributions from the community if you would like to work on these changes. I would be happy to address any questions you have along the way.

-Stewart

ibmua commented 8 years ago

So I ran my experiment and got to ~25% error so far on that test (390 epoch, ~0.002 LR at the moment), which is better than the published 26% for the usual thing. Though, I'm using an [LR decay 0.98 / 2 epochs] approach and not steps like original authors.

Okay, so I had no idea, you guys pushed optimizations THAT far. Okay, I'll try exploring how to profile stuff with nvprof, so far I've only profiled in nsight. =) Will write back when I'll get some data.

ibmua commented 8 years ago

Regarding contributions, IMHO, Neon could become a framework of choice for people exploring new concepts that require writing custom CUDA code due to PyCUDA and more overall cleanliness vs Torch/TensorFlow/Caffe. But you guys need to work a bit on clearly explaining how to extend different parts of it. With some rather simple non-cryptic non-SASS examples.

ibmua commented 8 years ago

I may be wrong, because I'm not into SASS, but as far as I understand, all you have to do is just like a 64 bit add to the register that stands for a painter to output and likewise for input and you're done, nah? Probably something like a merged multiply-add with a constant and blockIdx.z that's held in some special register as far as my knowledge goes, being multiplied. Kind of like that. All done at the very beginning, so any registers can fully be reused afterwards. So it shouldn't really touch any code further down the line. Unless there's some trouble with a register for blockIdx.z, like if it's used further in the code and it's read only, so you'll have to allocate another register to hold the value you currently hold inside blockIdx.z.

ibmua commented 8 years ago

If it's far from being that easy, I imagine, one could just launch Gnum of kernels consecutively to get Gnum of groups and only sync after they're all completed, though that will come at a price, as as far as I've read, launching each kernel would take around some 10 microseconds and probably more if the launch code is written in Python.

thestew42 commented 8 years ago

I may be wrong, because I'm not into SASS, but as far as I understand, all you have to do is just like a 64 bit add to the register that stands for a painter to output and likewise for input and you're done, nah? Probably something like a merged multiply-add with a constant and blockIdx.z that's held in some special register as far as my knowledge goes, being multiplied. Kind of like that. All done at the very beginning, so any registers can fully be reused afterwards.

Yes that would be the idea, however we don't store the pointers in registers. One trick we use is computing pointers on the fly from constants. However after a cursory glance it looks like our direct convolution kernels do store an offset for input/filters in a 32 bit register which could likely be used for this offset. Also we do use the blockIdx.z already, so this would require some kind of packing. Currently most of our packing is done with magic number division (where we have constants for magic and shift and the formula is something like blockIdx.z * magic >> shift) so we would need to add these parameters to the kernel arguments and generation code to the python side. As far as the winograd kernels, I'm not sure that something this simple is possible. I would need to spend some time looking at them.

If it's far from being that easy, I imagine, one could just launch Gnum of kernels consecutively to get Gnum of groups and only sync after they're all completed, though that will come at a price, as as far as I've read, launching each kernel would take around some 10 microseconds and probably more if the launch code is written in Python.

Yes you could do this as long as the kernels were launched in separate CUDA streams. The launch time will be a fixed overhead, so this could be nearly as efficient as the proposed kernel changes for longer-running cases. However, for small layers, this overhead may erase any performance gains of using multiple streams/kernels. This is one reason why it would be helpful to see profiling data. If 'x' microseconds is negligible for the cases where this optimization would help, it may be simpler to just change how the kernels are launched.

oleg-trott commented 8 years ago

https://arxiv.org/abs/1605.06489 seems to be the reference for this.

The fully-connected version of such grouping would use gemmBatched http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemmbatched

NervanaGPU also has batched_dot, but unlike the cuBLAS version, it requires one of its operands to be shared by all the dot products. So as far as can tell, batched_dot cannot be used here.

(By the way, I'm a very surprised that Intel seems to be continuing to invest in open-source Nvidia kernels. What's in it for them?)

ibmua commented 8 years ago

@oleg-trott OMFG! That's exactly what I've been researching for the last month! I've looked for similar papers a month ago, but couldn't find any. I've used these things half a year before in my project with the shitty Caffe implementation of groups and they gave a pretty large accuracy increase, although, speed suffered (due to how shittily they're made in Caffe), so I recently decided to experiment with them on benchmark datasets. Because, obviously, they would also be useful to others. And to try them on engines that would compute them more efficiently, so I can try using larger groupings and wider structures. Thanks a lot! Now I know that my ideas were correct. =)

One of the reasons why I tried running with a very small weight decay and why I've said that they learn faster (most likely, because of lack of the said decay) is that highly-grouped networks I was previously running didn't seem to benefit much from weight decay.

ibmua commented 8 years ago

So now that it's clear how important groups are, I hope Neon devs will prioritize this higher.

ibmua commented 8 years ago

https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua So the way Torch's cudnn module works is actually the same as I described, launching few kernels consecutively. So it turns out that that wasn't a very good idea. It works well only for small amounts of groups, and we need fast 64+ groups. If you'll be writing that in Python that will work even much worse than in Torch, because CPython is an extremely-poor-speed interpreter and Torch's LuaJIT is a fast JIT.

ibmua commented 8 years ago

@oleg-trott To my understanding both gemm and dot is not any good for convolutions, especially on GPU (but I may be wrong.. in that it may be even worse on CPU compared to more optimal ways than the difference on GPU) and is only used by people who know too much linear algebra and are too lazy to learn CUDA. Luckily, Neon's implementations are made purely and optimally for the convs on GPU.

oleg-trott commented 8 years ago

cuDNN includes "im2col" among its many algorithm choices, but I was just talking about fully-connected layers, which are an important special case of conv ones.

Channel grouping such as https://arxiv.org/abs/1605.06489 can obviously be applied there too.

gemmBatched has other uses too. I'll add a separate feature request.

ibmua commented 8 years ago

Regarding profiling, with these guys in the paper cutting convs into 64 groups without even making them wider, I guess, no more profiling is required, right? Seems fairly obvious that you need to change the kernel to do that right.

ibmua commented 8 years ago

@oleg-trott If by other uses you mean for example grouped linear, I'm totally subscribing. =)

Regarding im2col, yeah, that's what I'm talking about. When I was searching for a framework with clear code where I could easily implement a grouped kernel myself I bumped into plenty of these im2cols in all of the different frameworks. Still haven't a framework with clear code, btw. I decided to make my own at one point, so after thinking and researching a bit on what things to use and how to implement it I ended up considering making it with optimizing compiled kernels in PyCUDA on Python, & because CPU implementation for training would be an unnecessary waste of time, & because directly interfacing C++/CUDA code from any high-level language is pure hell & even more so if you also want to integrate it with Py's Numpy. So in the end it looked as if I've made all of the same decisions as Nervana guys and that's basically why I'm here. That's why I've said above that Nervana should seriously consider making a very clear guide on source tinkering. Because, they've already made all of the decisions needed to become the primary research engine of NN and they're the only serious ones who did that.

So as I was talking about im2col. I really struggled understanding what it means at first. I did know CUDA and it looked too awful to be true. I looked it up on the internet, but it didn't fit into my head that people could seriously consider implementing such a stupidity for conv computation in serious frameworks. But this paper https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ got me convinced they actually do. Luckily, even the old CuDNN circa 2014 didn't use it http://arxiv.org/pdf/1410.0759.pdf , though they may have went some more fancy and less productive way than the plain obvious convolutional one would be and they acknowledge that, showing around 30% worse efficiency than Alex's plain conv, but theirs seemed to perform better for smaller batches, which is not actually something too popular now anyway, although smaller batches do give better results.

But yeah, if you're talking about linear layers, BLAS is probably a way to go. IMHO, it would be something you could easily code yourself instead of requesting if only there were good docs on contributing. All of these frameworks are open source, but when you actually open the source you instantly know you can't change anything. There's a serious lack of an easily editable framework. And that one would be the winner, because all of the time the NN's are constantly evolving and you either have to keep up with that, throwing 20h into what could be done by some 5 lines of code if your code was actually not a candidate for IOCCC (yes, I'm referring to the current issue), or you can be the one to help with the evolving, be the engine for NN research and get all implementations straight from the authors without any effort on your part and be the framework cited in publications.

No, I understand that Neons SASS kernels with groups will be faster than C++ kernels with groups. It's just that if the original authors of the paper could have access to C++ kernels when they wrote the paper, meddle with code a bit, change some 5-10 lines and get them working in groups, it would help them a lot and could lead them in different directions. Also, the numbers would be much different. If they didn't have like a 100 grand gear, that they luckily seem to have had, it could be a question of life and death of their research.

ibmua commented 8 years ago

On the other hand, for a CPU, maybe, im2col would be relatively appropriate for non-, or little grouped convs at least for large feature map sizes. Needs testing and comparing to say for sure.

ibmua commented 8 years ago

@oleg-trott regarding an Intel company making optimal kernels, one funny reason to do that would be to cut demand for more and higher-cost-margin GPUs by making fewer cheaper ones "good enough". Currently there's a certain wall on what performance boost you can get with a bigger model. And there's also a wall related to how large can a model put into production be.

ibmua commented 8 years ago

Turns out grouped convs are available on TensorFlow as depthwise_conv2d https://www.tensorflow.org/versions/r0.10/api_docs/python/nn.html#depthwise_conv2d . Fully-grouped only, though.

ibmua commented 7 years ago

https://arxiv.org/pdf/1611.05431.pdf So grouped convs are now officially the next step in ResNet.

revilokeb commented 7 years ago

@ibmua do you happen to have a comparison what framework currently is supporting grouped convs efficiently?

ibmua commented 7 years ago

@revilokeb Sadly, I'm guessing that the answer lies between Nan, Undefined, {} and []. NVidia guys emailed me that "This feature has been on our radar and we’ll be introducing it in a future release of cuDNN." but I'm not 100% sure if it's actually going to be a part of CuDNN v7.

revilokeb commented 7 years ago

@ibmua :-) ok interesting thanks!! Based on benchmarks of mobilenet in TF here it seems that their implementation of grouped convs probably is at least superior to current pytorch...

ibmua commented 7 years ago

@revilokeb As for the CPU implementations, I don't know of any good CPU implementations in C/C++ either, though it should probably take no more than a few days to develop an implementation for a person who knows how to integrate a module like that into some existing framework. I tried doing this once and I did make a very efficient C++ implementation for fully-grouped convs, but the process of integrating it with any existing framework was too opaque for me. Also, going fully-grouped (max amount of groups) is probably not the best idea.

futurely commented 7 years ago

What’s New in cuDNN 7? Grouped Convolutions for models such as ResNeXt and Xception and CTC (Connectionist Temporal Classification) loss layer for temporal classification https://developer.nvidia.com/cudnn