Open fwyzard opened 2 weeks ago
@makortel FYI
Q1 - There are functionsalpaka::allocAsyncBuf
and alpaka::allocAsyncBufIfSupported
which use queues. Are we assuming that the user deliberately selected allocBuf
for a reason?
Q2 - Cant we create something like alpaka::allocAsyncMappedBuf
similar to allocAsyncBuf
?
Q1 - There are functions
alpaka::allocAsyncBuf
andalpaka::allocAsyncBufIfSupported
which use queues. Are we assuming that the user deliberately selectedallocBuf
for a reason?
Yes.
As a library, alpaka shouldn't be restricting what users are supposed to do, though of course it can restrict what is or isn't supported. But it would be nice to be able to catch at compile time (better) or at run time what is and isn't supported.
Q2 - Cant we create something like
alpaka::allocAsyncMappedBuf
similar toallocAsyncBuf
?
There is no native support for this in CUDA, ROCm, etc.
We have implemented it in the CMS code, it is what I am referring to as allocCachedBuf
.
While reviewing the use of alpaka buffer in the CMS code, we have seen a recurrent pattern that relies on the (undocumented) behaviour of the underlying back-ends.
Consider this example:
In principle we can observe different behaviours depending how the buffer was allocated in 1. and on what device back-end is being used:
allocBuf
(1.a), in principle there is no synchronisation, and the memory may be freed and reused before the copy completes (or even starts);allocMappedBuf
(1.b), in principle there is no synchronisation; however for some back-ends the call in the destructor of the buffer (_e.g._the call tocudaFreeHost
) is likely to block and synchronise with all back-end (e.g. CUDA) activity, making the copy safe;allocCachedBuf
(1.c), the buffer is guaranteed to be valid until all operations inqueue
have completed (assuming the buffer and the copy use the same queue).Note:
allocCachedBuf(host, queue, extent)
is a CMS implementation similar toallocAsyncBuf(queue, extent)
. I'm working to improve its performance and eventually upstream it to alpaka :-)Given that even such a simple example is error prone, we have been wondering how we could improve the situation.
A couple of ideas: