Closed CNugteren closed 6 years ago
Maybe a good compromise would be to offer option 1 (thus non-standard interface) and also whatever standard interface on top of this with additional penalty (the existing mechanism).
Does the buffer need to be full-size, or is it possible to only have those buffers for the chunks of matrices that are currently being processed by the kernels? If it is possible to use only these chunks, maybe those should be pre-allocated and used as a normal part of CLBlast functioning with that smallish memory claimed at the startup.
Thanks for your feedback! Note that I've just edited the text above where it said I would work on option 1 first, but I meant option 2. Sorry for the confusion.
Yes, having option 1 (the different interface) as an extra (optional) interface could make sense. I would have to make an interface as well to a function that computes the required size, because it's specific to the tuning parameters. This would be aimed at advanced users I guess.
No, small chunks is not an option. Not being part of the main buffer complicates things quite a lot in the kernel itself and will also not help in case a whole matrix needs to be transposed.
Option number 2 (max of one instead of 3 buffers) is now implemented, see PR #197. I will work on option 3 next (tuning of direct vs in-direct) as that is good to have in any case.
Option number 3 is now also implemented, see PR #212. For now the immediate performance problems related to this issue are solved, thus I will close it. In the future I might implement option 1 as an alternative for advanced users.
"Option 1" will be of very useful especially for users who're using embedded GPU's like Mali,Adreno etc. As mentioned in #201, the buffer creation time takes ~15-35 ms per GEMM call depending on the M,N,K sizes. Buffer creation alone takes additional 150ms during my custom DNN forward pass. This additional delay can be avoided if "Option 1" feature is available.
Thanks for the feedback. Yes, I'll be working on option 1 as well. I have it already on the roadmap (uncommitted version).
For the record, option 1 is now implemented as an optional argument in the C++ interface (#238) and is available in master. Use as follows:
size_t temp_buffer_size;
GemmTempBufferSize<T>(layout, a_transpose, b_transpose, m, n, k,
a_offset, a_ld, b_offset, b_ld, c_offset, c_ld,
&queue, temp_buffer_size);
cl_mem temp_buffer = ... // allocate with size 'temp_buffer_size'
Gemm(layout, a_transpose, b_transpose,
m, n, k, alpha,
a_mat, a_offset, a_ld,
b_mat, b_offset, b_ld, beta,
c_mat, c_offset, c_ld,
&queue, &event, temp_buffer);
@CNugteren Great ! It would be really helpful if you could add it to C interface as well :)
Hi @CNugteren Based on your C++ & CUDA interface for getting temporary buffer size, I tried adding it to C API's. I'm facing few issues while doing that, code snippet provided below. Could you please provide some pointers to how to go about doing this? Please correct me if I've missed anything.
Changes:
include/clblast_c.h
clblast::GemmTempBufferSize
from src/clblast_c.cpp
Error:
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp: In function ‘CLBlastStatusCode CLBlastGemmTempBufferSize(CLBlastLayout, CLBlastTranspose, CLBlastTranspose, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, _cl_command_queue**, _cl_event**, size_t&)’:
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: error: no matching function for call to ‘GemmTempBufferSize(clblast::Layout, clblast::Transpose, clblast::Transpose, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, const size_t&, _cl_command_queue*&, size_t&)’
*queue, temp_buffer_size)
^
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: note: candidate is:
In file included from /home/siva/02_Experiments/CLBlast/src/utilities/utilities.hpp:28:0,
from /home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:17:
/home/siva/02_Experiments/CLBlast/include/clblast.h:654:12: note: template<class T> clblast::StatusCode clblast::GemmTempBufferSize(clblast::Layout, clblast::Transpose, clblast::Transpose, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, _cl_command_queue**, size_t&)
StatusCode GemmTempBufferSize(const Layout layout, const Transpose a_transpose, const Transpose b_transpose,
^
/home/siva/02_Experiments/CLBlast/include/clblast.h:654:12: note: template argument deduction/substitution failed:
/home/siva/02_Experiments/CLBlast/src/clblast_c.cpp:3906:52: note: cannot convert ‘* queue’ (type ‘cl_command_queue {aka _cl_command_queue*}’) to type ‘_cl_command_queue**’
*queue, temp_buffer_size)
Code snippet:
include/clblast_c.h
:
// Custom additional status codes for CLBlast
CLBlastInsufficientMemoryTemp = -2050, // Temporary buffer provided to GEMM routine is too small
.......
.......
// GEMM temporary buffer size
CLBlastStatusCode PUBLIC_API CLBlastGemmTempBufferSize(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
const size_t m, const size_t n, const size_t k,
const size_t a_offset, const size_t a_ld,
const size_t b_offset, const size_t b_ld,
const size_t c_offset, const size_t c_ld,
cl_command_queue* queue, cl_event* event,
size_t& temp_buffer_size);
src/clblast_c.cpp
:
// GEMM
CLBlastStatusCode CLBlastSgemm(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
const size_t m, const size_t n, const size_t k,
const float alpha,
const cl_mem a_buffer, const size_t a_offset, const size_t a_ld,
const cl_mem b_buffer, const size_t b_offset, const size_t b_ld,
const float beta,
cl_mem c_buffer, const size_t c_offset, const size_t c_ld,
cl_command_queue* queue, cl_event* event,
cl_mem temp_buffer) {
try {
return static_cast<CLBlastStatusCode>(
clblast::Gemm(static_cast<clblast::Layout>(layout),
static_cast<clblast::Transpose>(a_transpose),
static_cast<clblast::Transpose>(b_transpose),
m, n, k,
alpha,
a_buffer, a_offset, a_ld,
b_buffer, b_offset, b_ld,
beta,
c_buffer, c_offset, c_ld,
queue, event, temp_buffer)
);
} catch (...) { return static_cast<CLBlastStatusCode>(clblast::DispatchExceptionForC()); }
}
...........
...........
// GEMM get temporary buffer size
CLBlastStatusCode CLBlastGemmTempBufferSize(const CLBlastLayout layout, const CLBlastTranspose a_transpose, const CLBlastTranspose b_transpose,
const size_t m, const size_t n, const size_t k,
const size_t a_offset, const size_t a_ld,
const size_t b_offset, const size_t b_ld,
const size_t c_offset, const size_t c_ld,
cl_command_queue* queue, cl_event* event,
size_t& temp_buffer_size){
try {
return static_cast<CLBlastStatusCode>(
clblast::GemmTempBufferSize(static_cast<clblast::Layout>(layout),
static_cast<clblast::Transpose>(a_transpose),
static_cast<clblast::Transpose>(b_transpose),
m, n, k,
a_offset, a_ld,
b_offset, b_ld,
c_offset, c_ld,
queue, temp_buffer_size)
);
} catch (...) { return static_cast<CLBlastStatusCode>(clblast::DispatchExceptionForC()); }
}
You'll probably have to provide the template type for the GemmTempBufferSize
function when you call it, e.g. clblast::GemmTempBufferSize<float>(static_cast<clblast::Layout>(layout), ...)
. So actually you'll have to write 5 CLBlastGemmTempBufferSize
functions, and name this one CLBlastSGemmTempBufferSize
to indicate that it is for single precision.
@CNugteren I've added C specific temp buff size api and created a PR#253. Please review
The main 'in-direct' GEMM kernel in CLBlast might require up to 3 temporary buffers to store transposed or padded copies of the A, B, and C matrices. Although buffer creation seems fast, the creation cost becomes apparent only after the first use in a kernel (at least on a test CUDA device).
Temporary buffer creation can become quite a big issue for CLBlast for small problem sizes or for fast GPUs, as this takes ~0.4ms per buffer on a test GeForce Titan X. Here are some ideas of how to potentially solve this:
I'm leaning towards implementing option 2 soon (making it one instead of three buffers), and following-up later with option 3 to automatically find a transition point from direct to in-direct kernel. Tuning this automatically is on the list anyways, since now it is just a manual per-vendor setting.
Any thoughts? Other ideas how to make temporary buffer creation fast?