ggerganov / ggml

Tensor library for machine learning
MIT License
11.25k stars 1.05k forks source link

A specialized Winograd Conv2d op #971

Open bssrdf opened 1 month ago

bssrdf commented 1 month ago

This PR added a new conv2d op using Winograd algorithm.

Currently ggml's conv2d operator uses im2col and GEMM. There have been efforts to speed up this process using other faster algorithms. Winograd is such a method used by many neural network libraries, e.g. Cudnn . For small kernels, e.g. 3x3, Winograd outperforms GEMM based methods. However, efficient implementation of Winograd on GPUs requires significant engineering efforts. This PR 's Winograd implementation specializes in several ways:

Other features:

It is mainly used for applications such as stable-diffusion.cpp.

The code is based on openCNN project which uses Apache-2.0 license.

Please review and let me know any problems I'll address. Thanks.

JohannesGaessler commented 1 month ago

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

bssrdf commented 1 month ago

Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores.

Thank you for your review, @JohannesGaessler. I leaned a lot from your PRs and comments.

First, I have asked openCNN's author for license issue.

As to performance, I only tested it in SD.cpp as it is developed for it. It is not faster (surprised) than im2col+GEMM with tensor cores (my gpu has them so assuming being used) but definitely not slower. It reduces memory used by VAE quite a lot while increasing UNET param buffer. There is room to further improve its performance as I see several places are not working in an optimal way.

I'll add test cases in test-backend-ops to more rigorously measure performance.

I addressed your other comments above.

slaren commented 1 month ago

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing. I don't think this can be merged unless the license situation is resolved.

JohannesGaessler commented 1 month ago

Generally speaking my stance regarding this PR would be as follows: I think it's good to have convolution operations instead of having to rely on IM2COL. At the same time I want to have a codebase that is easy to maintain - a central factor for me is that there needs to be some benefit for adding code that offsets the increase in maintenance effort. Quite honestly I think that the starting point from OpenCNN is not very good; I would be rather hesitant to add it since the use cases are limited and I think none of the devs on this project would have a very good understanding of how the code works.

And as slaren said, the licensing issue must be resolved or this is a total non-starter anyways.

Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing.

From what I can tell, there is convolution support.

JohannesGaessler commented 1 month ago

I'm already tired so maybe I'm just misreading the docs, but I get the impression that NPP convolutions only support 1-4 input channels.

bssrdf commented 1 month ago

Thanks to both of you for reviewing. I am not familiar with the license. In case it is not resolvable, I'll ditch this PR. Now I am putting it in draft mode, hoping to make it work in more general settings, truly serving as an alternative to im2col approach.

JohannesGaessler commented 1 month ago

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

bssrdf commented 1 month ago

Also one important question that I forgot to ask: are you going to be available long-term to maintain this code?

If this PR makes into the main, I intend to maintain it long term and improve its performance.

bssrdf commented 1 month ago

I am getting some puzzling benchmark results. @JohannesGaessler or @slaren, could you help?

I added a conv2d and a winograd case in test-backend-ops, using its eval_perf function to gauge the performance.

Winograd is consistently 10x slower than conv2d(IM2COL++GEMM). However, the same winograd kernel in stable-diffusion improves the running time by 7-8% for UNET and almost 50% for VAE. So I am confused.

What's even more interesting is the same input size for IM2COL itself runs slower than conv2d which has a GEMM after IM2COL.

Maybe the way I tested is not proper for conv2d since it is a composite operator unlike others?

slaren commented 1 month ago

Maybe the way I tested is not proper for conv2d since it is a composite operator unlike others?

Yes, this is the reason. The perf mode works by running multiple times the last node of the graph only, everything before that is assumed to be setup for the op and only run once.

bssrdf commented 1 month ago

Maybe the way I tested is not proper for conv2d since it is a composite operator unlike others?

Yes, this is the reason. The perf mode works by running multiple times the last node of the graph only, everything before that is assumed to be setup for the op and only run once.

Wow, now I see. Turned out that the conv2d case is just testing the cuda_dup op which is the last node. I need to find another way to compare. Thanks.

Green-Sky commented 1 month ago

@bssrdf just took your sd.cpp fork for a spin, and I like the numbers. for flux's vae, it shrinks vram usage roughly in half, and gives a small speed bump. before : 94+1664 MB ~0.75s after : 258+704 MB ~0.53s

with sd2 the unet changes seem to make it use almost double the vram for the model? vae wins here again though (almost the same values as flux's vae).

images where 512x512.

bssrdf commented 1 month ago

@bssrdf just took your sd.cpp fork for a spin, and I like the numbers. for flux's vae, it shrinks vram usage roughly in half, and gives a small speed bump. before : 94+1664 MB ~0.75s after : 258+704 MB ~0.53s

with sd2 the unet changes seem to make it use almost double the vram for the model? vae wins here again though (almost the same values as flux's vae).

images where 512x512.

@Green-Sky, thanks for trying out. Right now, winograd is not ready for SD.cpp due to duplicated UNET and VAE parameter buffer. My winograd implementation has filter weight transform in a separate kernel (inherited from openCNN) so needs a buffer for that. I added filter transform in initialization stage so during unet denoising this step can be skipped (just use the buffer). This is one of the benefits of using winograd (a small time saving though). However, the original/untransformed filter weight still occupies VRAM as they are allocated earlier and there is no mechanism in GGML to release them. I have some ideas to fix this so no VRAM duplication happens; but still the parameter buffers for UNET and VAE will increase in size because the transformed weights are C*K*4*4 and the untransformed are C*K*3*3. The compute buffer sizes are reduced by quite a lot, as you found out, because of no IM2COL. In my tests, VAE runs twice as fast. If you, @FSSRepo or @leejet could help with this, that will be great.

In the meantime, I am adding tensor core support to make it even faster. I think it could be faster by at least 50% once the tensor cores are used.

FSSRepo commented 1 month ago

Hello, good luck trying to use the tensor cores to get Winograd working; I think it’s possible. I saw an implementation that used matrix multiplication in Tinygrad. I found a Winograd implementation that omitted redundant calculations, but for that same reason, it wasn't possible to use the tensor cores. In the end, I couldn't improve the performance. After all, I decided to better merge im2col and GEMM (which reduces memory consumption by 50%, but is 40% slower in the VAE phase, which is primarily the one that uses the most memory).

bssrdf commented 1 month ago

Hello, good luck trying to use the tensor cores to get Winograd working; I think it’s possible. I saw an implementation that used matrix multiplication in Tinygrad. I found a Winograd implementation that omitted redundant calculations, but for that same reason, it wasn't possible to use the tensor cores. In the end, I couldn't improve the performance. After all, I decided to better merge im2col and GEMM (which reduces memory consumption by 50%, but is 40% slower in the VAE phase, which is primarily the one that uses the most memory).

Thanks for the comments. Right now, using tensor cores, my winograd kernel outperforms cudnn's winograd (not using tensor cores) by 15% for certain inputs (large input size and number of channels and filters). Well, dealing with the bank conflicts are real pain and I have to program in PTX :smile:. For such large inputs, the king is cudnn's IMPLICIT_PRECOMP_GEMM using tensor cores; it is 50% faster than my winograd. I hope to get more performance boost once the final "Uncoalesced Shared Accesses" issue is resolved.

JohannesGaessler commented 1 month ago

Be aware that if you are using the nvcuda::wmma interface the data layout and memory access pattern are not defined. So your code could end up having bad performance on CUDA architectures that you are not testing. For FP16 the memory access pattern for all GPUs ranging from Turing to Ada Lovelace should be the same though. My recommendation would be to write simple primitives that wrap the tensor core PTX instructions instead. I did an implementation of this for int8 in mma.cuh. This would then give you a defined memory layout and enable you to manipulate the data without having to go through shared memory.

Edit: judging by the edit to your previous comment you are already using PTX.

JohannesGaessler commented 1 month ago

For avoiding shared memory bank conflicts with tensor cores, consider padding the data. For the int8 tensor cores I found that a padding of 16 bytes between rows/columns avoids shared memory bank conflicts when loading data from shared memory for use with tensor cores.

bssrdf commented 1 month ago

For avoiding shared memory bank conflicts with tensor cores, consider padding the data. For the int8 tensor cores I found that a padding of 16 bytes between rows/columns avoids shared memory bank conflicts when loading data from shared memory for use with tensor cores.

Thanks for the tips, @JohannesGaessler. Yes, I used padding in several places and they really helped with resolving conflicts. I also played around with swizzling but in the end couldn't get it working for the smem layout.

JohannesGaessler commented 1 month ago

Also take a look at the ldmatrix instruction. While it doesn't reduce shared memory bank conflicts it does help by issuing fewer instructions which helped with one of the throttle reasons (I forgot which one). But I didn't find it to be universally better than just loading the data as 4 bit values from shared memory.

bssrdf commented 1 month ago

Also take a look at the ldmatrix instruction. While it doesn't reduce shared memory bank conflicts it does help by issuing fewer instructions which helped with one of the throttle reasons (I forgot which one). But I didn't find it to be universally better than just loading the data as 4 bit values from shared memory.

I see people used ldmatrix in some GEMM kernels and it didn't help much, so I didn't bother. Yeah, good to know it though.

FSSRepo commented 1 month ago

@JohannesGaessler I know it's out of context, but I'm compiling the latest version of stable diffusion.cpp and now it's taking more than 25 minutes to compile the CUDA code. Before (four months ago, I was busy trying to figure out what to do with my life), it took at most 5 minutes or less, and binaries now are more bigger.

JohannesGaessler commented 1 month ago

That's probably due to the MMQ changes. For development builds you can edit src/CMakeLists.txt and set CMAKE_CUDA_ARCHITECTURES to only the one that you're actually using.