Slow creation of temporary buffers

CNugteren commented 7 years ago

The main 'in-direct' GEMM kernel in CLBlast might require up to 3 temporary buffers to store transposed or padded copies of the A, B, and C matrices. Although buffer creation seems fast, the creation cost becomes apparent only after the first use in a kernel (at least on a test CUDA device).

Temporary buffer creation can become quite a big issue for CLBlast for small problem sizes or for fast GPUs, as this takes ~0.4ms per buffer on a test GeForce Titan X. Here are some ideas of how to potentially solve this:

Pass the buffers in from the outside - and thus ask the user to allocate these. I don't like this idea because it changes the interface and requires quite complex logic from the user's side what the sizes of these buffers should be and whether they should exist at all.
The 3 temporary buffers can be combined into a single one. That will complicate the code a bit, also because each of the 3 is optional, but it should be possible. That should reduce execution time already, but still not completely.
CLBlast has a second 'direct' implementation of the GEMM kernel which doesn't require temporary buffers. Right now there is a threshold set to use this alternative for smaller kernels. An option would be to first tune all individual kernels (like now), but then also tune the whole routine execution on one variable: whether or not to use the direct implementation. In case the buffer creation time is too large, the direct implementation will automatically be used: the tuner will decide where to set this threshold.
Once a temporary buffer is created, it can be cached within CLBlast for later use if it is big enough to fit a next invocation of the GEMM routine. However, this would lead to memory claimed by CLBlast throughout the rest of a user's program.

I'm leaning towards implementing option 2 soon (making it one instead of three buffers), and following-up later with option 3 to automatically find a transition point from direct to in-direct kernel. Tuning this automatically is on the list anyways, since now it is just a manual per-vendor setting.

Any thoughts? Other ideas how to make temporary buffer creation fast?

blueberry commented 7 years ago

Maybe a good compromise would be to offer option 1 (thus non-standard interface) and also whatever standard interface on top of this with additional penalty (the existing mechanism).

Does the buffer need to be full-size, or is it possible to only have those buffers for the chunks of matrices that are currently being processed by the kernels? If it is possible to use only these chunks, maybe those should be pre-allocated and used as a normal part of CLBlast functioning with that smallish memory claimed at the startup.

CNugteren commented 7 years ago

Thanks for your feedback! Note that I've just edited the text above where it said I would work on option 1 first, but I meant option 2. Sorry for the confusion.

Yes, having option 1 (the different interface) as an extra (optional) interface could make sense. I would have to make an interface as well to a function that computes the required size, because it's specific to the tuning parameters. This would be aimed at advanced users I guess.

No, small chunks is not an option. Not being part of the main buffer complicates things quite a lot in the kernel itself and will also not help in case a whole matrix needs to be transposed.

CNugteren commented 7 years ago

Option number 2 (max of one instead of 3 buffers) is now implemented, see PR #197. I will work on option 3 next (tuning of direct vs in-direct) as that is good to have in any case.

CNugteren commented 6 years ago

Option number 3 is now also implemented, see PR #212. For now the immediate performance problems related to this issue are solved, thus I will close it. In the future I might implement option 1 as an alternative for advanced users.

sivagnanamn commented 6 years ago

"Option 1" will be of very useful especially for users who're using embedded GPU's like Mali,Adreno etc. As mentioned in #201, the buffer creation time takes ~15-35 ms per GEMM call depending on the M,N,K sizes. Buffer creation alone takes additional 150ms during my custom DNN forward pass. This additional delay can be avoided if "Option 1" feature is available.

CNugteren commented 6 years ago

Thanks for the feedback. Yes, I'll be working on option 1 as well. I have it already on the roadmap (uncommitted version).

CNugteren commented 6 years ago

For the record, option 1 is now implemented as an optional argument in the C++ interface (#238) and is available in master. Use as follows:

size_t temp_buffer_size;
GemmTempBufferSize<T>(layout, a_transpose, b_transpose, m, n, k,
                      a_offset, a_ld, b_offset, b_ld, c_offset, c_ld,
                      &queue, temp_buffer_size);
cl_mem temp_buffer = ... // allocate with size 'temp_buffer_size'
Gemm(layout, a_transpose, b_transpose,
     m, n, k, alpha,
     a_mat, a_offset, a_ld,
     b_mat, b_offset, b_ld, beta,
     c_mat, c_offset, c_ld,
     &queue, &event, temp_buffer);

sivagnanamn commented 6 years ago

@CNugteren Great ! It would be really helpful if you could add it to C interface as well :)

sivagnanamn commented 6 years ago

Hi @CNugteren Based on your C++ & CUDA interface for getting temporary buffer size, I tried adding it to C API's. I'm facing few issues while doing that, code snippet provided below. Could you please provide some pointers to how to go about doing this? Please correct me if I've missed anything.

Changes:

Add Error code & CLBlastGemmTempBufferSize() API to include/clblast_c.h
Add cl_mem temp_buffer parameter to GEMM API call & invoke clblast::GemmTempBufferSize from src/clblast_c.cpp

Error:

/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp: In function ‘CLBlastStatusCode CLBlastGemmTempBufferSize(CLBlastLayout, CLBlastTranspose, CLBlastTranspose, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, _cl_command_queue**, _cl_event**, size_t&)’:
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: error: no matching function for call to ‘GemmTempBufferSize(clblast::Layout, clblast::Transpose, clblast::Transpose, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, _cl_command_queue*&, size_t&)’
                            *queue, temp_buffer_size)
                                                    ^
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: note: candidate is:
In file included from /home/siva/02_Experiments/CLBlast/src/utilities/utilities.hpp:28:0,
                 from /home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:17:
/home/siva/02_Experiments/CLBlast/include/clblast.h:654:12: note: template<class T> clblast::StatusCode clblast::GemmTempBufferSize(clblast::Layout, clblast::Transpose, clblast::Transpose, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, _cl_command_queue**, size_t&)
 StatusCode GemmTempBufferSize(const Layout layout, const Transpose a_transpose, const Transpose b_transpose,
            ^
/home/siva/02_Experiments/CLBlast/include/clblast.h:654:12: note:   template argument deduction/substitution failed:
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: note:   cannot convert ‘* queue’ (type ‘cl_command_queue {aka _cl_command_queue*}’) to type ‘_cl_command_queue**’
                            *queue, temp_buffer_size)

Code snippet:

include/clblast_c.h :

// Custom additional status codes for CLBlast
  CLBlastInsufficientMemoryTemp    = -2050, // Temporary buffer provided to GEMM routine is too small
.......
.......
// GEMM temporary buffer size
CLBlastStatusCode PUBLIC_API CLBlastGemmTempBufferSize(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
                               const size_t m, const size_t n, const size_t k,
                               const size_t a_offset, const size_t a_ld,
                               const size_t b_offset, const size_t b_ld,
                               const size_t c_offset, const size_t c_ld,
                               cl_command_queue* queue, cl_event* event,
                               size_t& temp_buffer_size);

src/clblast_c.cpp:

// GEMM
CLBlastStatusCode CLBlastSgemm(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
                               const size_t m, const size_t n, const size_t k,
                               const float alpha,
                               const cl_mem a_buffer, const size_t a_offset, const size_t a_ld,
                               const cl_mem b_buffer, const size_t b_offset, const size_t b_ld,
                               const float beta,
                               cl_mem c_buffer, const size_t c_offset, const size_t c_ld,
                               cl_command_queue* queue, cl_event* event,
                               cl_mem temp_buffer) {
  try {
    return static_cast<CLBlastStatusCode>(
      clblast::Gemm(static_cast<clblast::Layout>(layout),
                    static_cast<clblast::Transpose>(a_transpose),
                    static_cast<clblast::Transpose>(b_transpose),
                    m, n, k,
                    alpha,
                    a_buffer, a_offset, a_ld,
                    b_buffer, b_offset, b_ld,
                    beta,
                    c_buffer, c_offset, c_ld,
                    queue, event, temp_buffer)
    );
  } catch (...) { return static_cast<CLBlastStatusCode>(clblast::DispatchExceptionForC()); }
}
...........
...........
// GEMM get temporary buffer size
CLBlastStatusCode CLBlastGemmTempBufferSize(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
                               const size_t m, const size_t n, const size_t k,
                               const size_t a_offset, const size_t a_ld,
                               const size_t b_offset, const size_t b_ld,
                               const size_t c_offset, const size_t c_ld,
                               cl_command_queue* queue, cl_event* event,
                               size_t& temp_buffer_size){

  try {
    return static_cast<CLBlastStatusCode>(
      clblast::GemmTempBufferSize(static_cast<clblast::Layout>(layout),
                           static_cast<clblast::Transpose>(a_transpose),
                           static_cast<clblast::Transpose>(b_transpose),
                           m, n, k,
                           a_offset, a_ld,
                           b_offset, b_ld,
                           c_offset, c_ld,
                           queue, temp_buffer_size)
    );
  } catch (...) { return static_cast<CLBlastStatusCode>(clblast::DispatchExceptionForC()); }
}

CNugteren commented 6 years ago

You'll probably have to provide the template type for the GemmTempBufferSize function when you call it, e.g. clblast::GemmTempBufferSize<float>(static_cast<clblast::Layout>(layout), ...). So actually you'll have to write 5 CLBlastGemmTempBufferSize functions, and name this one CLBlastSGemmTempBufferSize to indicate that it is for single precision.

sivagnanamn commented 6 years ago

@CNugteren I've added C specific temp buff size api and created a PR#253. Please review

CNugteren / CLBlast

Slow creation of temporary buffers #195