CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 202 forks source link

Huge performance degradation when calling matrix multiply in a loop #370

Open blueberry opened 5 years ago

blueberry commented 5 years ago

The issue:

When calling matrix multiplication in a tight loop (as is common in neural network training), the performance is hugely degraded - order(s) of magnitude. The trick is that it does not happen when matrices in the operation always have the same dimensions (as common in benchmarks), but when there is a repeating series of multiplications with different sizes that gets repeated.

In short, if I create one example and enqueue it in a queue many times, it works as expected. But, if I create several (say, 10) multiplications of matrices of varying sizes, and call that in a loop, performance becomes many, many times slower than would be expected.

This usage pattern is common in neural network implementation, where each layer is represented by a few matrices, and there are forward and backward operations, that are implemented by a few matrix multiplications (and a few other less demanding BLAS operations). The algorithm should call forward on each layer in succession, then backward in the opposite order, and then repeat this many times (say, 100 or 1000).

Important: cuBLAS does not have this problem. The equivalent implementation of neural networks in CUDA with cuBLAS works as expected. MKL-based CPU implementation works well, too.

I use cuBLAS through Java bindings and Clojure, but I am pretty certain that it is not where the issue is. It seems to me (without knowing CLBlast internals sufficiently well to be able to check my suspicions) that CLBlast creates some temporary work/scratch space during matrix multiplication. If the dimensions stay the same, this gets reused, and the performance is good even in tight loops, but, if the dimensions are varying, then that scratchpad has to be destroyed/released which seems to put some pressure, and clogs the queue in some way when called in tight loop...

I did not create a C demo, but I wrote extensively about the code that implements this, and I have demonstrated the problem and wrote a discussion that I hope would be enough to identify the problem. You can read it here: https://dragan.rocks/articles/19/Deep-Learning-in-Clojure-From-Scratch-to-GPU-12-A-Simple-Neural-Network-Training-API#orgee9ca34

Basically, the way to reproduce this is to create several mm operations with different dimensions and call these in tight loop (and make sure the matrices are sufficiently large, I suppose)

EDIT: This happens on both AMD R9 290X and Vega 64, with old proprietary Catalyst drivers, and new open-source ROCm, on Linux.

@CNugteren Am I on the right track with the work/scratchpad issue? If it is so, is there a space to somehow expose the scratchpad mechanism so the caller can provide sufficiently large memory space in advance, and let CLBlast reuse that throughout the lifecycle?

CNugteren commented 5 years ago

Just a quick short response for now from my phone (I'll have a more detailed look later): there is an API to CLBlast to pass a temporary buffer (this is what you mean with scratchpad?) that you've already pre-allocated. Perhaps you could try that?

blueberry commented 5 years ago

I will. Can you point me to the right function in the API (when you have time, I'm not in a hurry). I am using CLBlast via JOCLBlast, and I hope that part is supported, but even if it is not, I'll see to add it, I just need to know what is the right way. Having an example would help, but if it doesn't exist, it is not a must.

Thank you!

CNugteren commented 5 years ago

The regular C++ call to Gemm has an optional argument to provide a pre-allocated OpenCL temporary buffer (https://github.com/CNugteren/CLBlast/blob/master/include/clblast.h#L526). There's also a function to query given specific dimensions what the minimum size of that buffer should be (https://github.com/CNugteren/CLBlast/blob/master/include/clblast.h#L695), but you can also guess or be a bit pessimistic (at least to try it out). There's a little bit of documentation here as well.

About the sample, I don't have anything unfortunately. But given the basic GEMM C++ sample I don't think it should be too difficult to extend it to your scenario (e.g. go in a loop over different values of m, n, and k), then measure speed, and then measure speed with a temporary buffer. If you can reconstruct your problem based on this example and share it here that would be great - then we have something to work with.

By the way, when you say 'huge performance degradation', what do you mean exactly? Allocating a temporary buffer can take some time, but it should be relatively small compared to a large matrix multiplication (what is large to you?) on a typical device.

blueberry commented 5 years ago

By the way, when you say 'huge performance degradation', what do you mean exactly? Allocating a temporary buffer can take some time, but it should be relatively small compared to a large matrix multiplication (what is large to you?) on a typical device.

At least an order of magnitude. In that particular example, something that takes 2 seconds with CUDA on GTX 1080 Ti, and 3 seconds with CLBlast on R9 290X if I synchronize the queue at the end of every cycle (so it's synchronized after several different multiplications), takes 30 seconds or a minute if I just let the loop build up a queue. Obviously, at some point, the queue gets clogged. Fortunately, it is not a pressing issue, so when I get back to fix this I'll try your proposed solution with providing temporary buffers, and then I'll do measurements and post how it went.

Thank you!

CNugteren commented 5 years ago

Thanks for the explanation. I can't imagine that the buffer allocation would be such an overhead, so I doubt that it fixes it. You could also compile CLBlast in verbose mode (through a CMake setting), that way it will print what it is doing and even includes timing for various things (e.g. compiling the kernels and running them).

Talking about compiling the kernel: CLBlast uses a cache, so only the first time should be slow. However, for each device but also for each OpenCL context it will need to rebuild (part of) the cache. So perhaps your code (or JOCLBlast) somehow creates a different context each time, requiring CLBlast to re-compile the kernel every time? Good to double-check that.