Open smeso opened 2 months ago
I only added the IM2COL_BACK
operator in order to enable training of the convolutional network in the MNIST example. If you can make that work with your COL2IM
implementation I have no issue whatsoever with removing my code again.
Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.
The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication. And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for IM2COL
/COL2IM
is quite substantial and for I/O-bound applications the speed will likely be worse as well. I currently do not use ggml convolutions for anything though so I personally do not care much about this. If I can get the training costs cheap enough I will maybe try to train an image model in 2025 using ggml. At that point I would likely overhaul the convolution code regardless of whether or not this PR is merged.
While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.
I think it would be better to have a single operator, COL2IM
would be a better name I think.
In tests/test-grad0
there is a test for IM2COL
. If you can make that test pass using your code then from my end it would be fine to remove my code.
Implementing conv_transpose_1d via mulmat+col2im uses more memory but it can be ~80% faster (based on a very simple benchmark I did) than the current approach when the input data is relatively big.
The only reason why this is currently faster is a lack of optimization in the convolution kernels while a comparatively large amount of work has gone into optimizing general matrix multiplication.
Yes, of course. I didn't mean to imply that I did any magic trick here. mulmat+col2im is a well established method for transposed convolutions because there are many optimizations around GEMM. And of course, there is a memory/speed trade-off here. Also, for smaller kernels, Winograd multiplication should work even better (I don't think that ggml has this optimization currently).
And I'm not convinced that the code in this PR is necessarily an upgrade over the code on master because the extra memory needed for
IM2COL
/COL2IM
is quite substantial and for I/O-bound applications the speed will likely be worse as well.
Yes, it does use more memory: in the worst case scenario, in my tests, it was ~800MB vs ~1MB. In relative terms is a lot, but it works well for my use-case (hopefully for others as well, especially if smaller/quantized types are used). Why do you think that the speed would be worse? I'm sure that the code in this PR can be improved, but it's already much faster than the code in master and it gets faster and faster as the kernel/input size grows. These are a couple of plots of the speed of the 2 versions as kernel+input size grows. I'm sure there are better ways to benchmark these functions, but this was quick: (of course the size scale is log2 and time is linear)
The CUDA test is much shorter (i.e. I stopped it at a much lower size than the CPU test) because the old code was so slow that it was, in fact, much slower than using the CPU.
While col2im and im2col_back sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.
I think it would be better to have a single operator,
COL2IM
would be a better name I think.
After reviewing your code I realized that these 2 functions don't have much in common except the similar name. It should be possible to obtain one function's result from the other multiplying/dividing (element-wise) the result by some matrix... but I don't think it's worth trying, it will likely be much worse, performance-wise, in both cases.
Why do you think that the speed would be worse?
Assuming two square matrices GEMM needs $O(N^3)$ compute and $O(N^2)$ I/O. GEMM is compute bound if the dimensions of all matrices are large. If one of the dimensions is very thin however, it is instead I/O bound. It should be comparatively more easy to write a convolution kernel that utilizes I/O well vs. a kernel that utilizes compute well. So for the I/O-bound case the speed difference between convolution and GEMM should be comparatively smaller (if at all) while IM2COL
also increases the I/O even further due to redundant values.
Oh wait. I thought that your
I'm not convinced that the code in this PR is necessarily an upgrade over the code on master
was specifically referring to something wrong with my code.
Now it's clearer: your were talking about GEMM in general.
Yeah, you are right in the general case, but...
In the specific case of conv_transposed_1d
, the multliplication is performed between $(IC, KW OC)$ and $(IC, IW N)$.
Now, even if you have no control over $IC$, $KW$ and $OC$, you can control the value of $IW N$ and pick one that will maintain $AI = 1$.
A direct approach will have a big advantage if a large padding is used, in that case you will be able to skip a lot of calculations and memory accesses, while GEMM still has to do all of the them. But with a small padding it will basically have to do the same memory accesses and operations (asymptotically) if I'm not mistaken.
Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use). The better performance is a nice side effect, but it wasn't what I was looking for. The increased memory usage isn't nice, but for my use-case I'm happy to trade some memory for speed.
Of course, if in the future you can make it even faster, I'll be even happier!
P.S. Here is another plot with various kernel sizes and $IW$ s chosen appropriately to maintain $AI=1$ (the shades are $3\sigma$ intervals)
Anyway this PR wasn't really about performance, I'm doing it to add support for batching, dilation and padding (that I need to use).
In a similar way I work mainly on performance (which in turns ends up mostly matrix multiplications) so that is what a lot of my commentary tends to be about :)
BTW, if the increased memory usage is a concern, instead of re-implementing the existing conv_transpose_1d
, I could add a new conv_transpose_1d_gemm
and let users decide for themselves.
I wanted to add support for batching, padding and dilation to
conv_transpose_1d
and I decided to do it by re-implementing the operator usingmulmat
+col2im
.col2im
at the moment only supports 1 dimension, but it can be easily extended to support 2d without breaking its API (it can be done in a future PR). I also created the CUDA and SYCL kernels and I wanted also to add the Vulkan shader (maybe in a different PR).Implementing
conv_transpose_1d
viamulmat
+col2im
uses more memory (not always) but is much faster (not always).Because of the increased memory usage, it has been implemented as a new operator, so that people concerned about memory can still use the old version.
Here is a plot that compares performance and memory usage of the 2 approaches:
Long story short: the bigger $IC$ is (compared to the other dimensions) the better. For very small kernels with $IC \leq 2$ the old version is better.
~I noticed that, since I started working on this, a new
im2col_back
operator was added. Whilecol2im
andim2col_back
sound similar, the code and the way they work is a bit different. They can't work just as drop-in replacement for each other, but I think that they could be merged in a single function, but I don't know if this makes sense.~~@JohannesGaessler being the author of
im2col_back
, do you have any opinion on this PR? Unfortunately I didn't think that my work had a chance to clash with someone else's and I didn't coordinate first.~