Imagine we read an RGB image with three uchar per pixel (uchar[3]). All pixels of image are written (packed) into a cl_mem buffer. Then we want to use async_work_group_copy_2D2D() to optimize memory transfer between __global and __local.
The point is:
If we use uchar3 vector type. The OpenCL 1.2 spec defined that "async_work_group_copy and async_work_group_strided_copy for 3-component vector types behave as async_work_group_copy and async_work_group_strided_copy respectively for 4-component vector types.", probably because 3-component vectors are aligned to 4-component ones. As a result, we will fall into a case where we have uchar4-like pointer arithmetic on a real-packed-uchar[3] buffer, that turns out to be very error-prone.
This drawback can be avoided by fallbacking to the unit uchar interface and multiplying by three the associated num_elements_per_line and src/dst strides. But this adds more verbosity (and complexity) to the kernel code.
More, on other stencil codes where the cell composition is none of the specified vector type in the spec (for example float9 for a D2Q9 (two-dimensional) [1] Lattice Boltzmann method (LBM) solver or float19 for a D3Q19 (three-dimensional) [2]). We would have only the choice of the unit float interface of async_work_group_copy_2D2D, and the corresponding address-/stride-calculation will be a headache.
I'm wondering if we can improve the new async DMA spec for more ease of coding/optimizing for these scientific stencil applications ? like we don't bother calculating the begin address of each sub-image but always give in the original buffer pointer and position index (i, j) of the sub-block to be copied - the developer reasons in term of pixel, not byte or gentype. Then the async API, by taking an extra num_gentype_per_pixel for example, manages to jump to the correct address and copy the right amount of data underlying.
Below is a generic 2D2D copy and its necessary parameters in my mind:
Imagine we read an RGB image with three
uchar
per pixel (uchar[3]
). All pixels of image are written (packed) into acl_mem
buffer. Then we want to useasync_work_group_copy_2D2D()
to optimize memory transfer between__global
and__local
.The point is:
If we use
uchar3
vector type. The OpenCL 1.2 spec defined that "async_work_group_copy
andasync_work_group_strided_copy
for 3-component vector types behave asasync_work_group_copy
andasync_work_group_strided_copy
respectively for 4-component vector types.", probably because 3-component vectors are aligned to 4-component ones. As a result, we will fall into a case where we haveuchar4
-like pointer arithmetic on a real-packed-uchar[3] buffer, that turns out to be very error-prone.This drawback can be avoided by fallbacking to the unit
uchar
interface and multiplying by three the associatednum_elements_per_line
and src/dststrides
. But this adds more verbosity (and complexity) to the kernel code.More, on other stencil codes where the cell composition is none of the specified vector type in the spec (for example
float9
for a D2Q9 (two-dimensional) [1] Lattice Boltzmann method (LBM) solver orfloat19
for a D3Q19 (three-dimensional) [2]). We would have only the choice of the unitfloat
interface ofasync_work_group_copy_2D2D
, and the corresponding address-/stride-calculation will be a headache.I'm wondering if we can improve the new async DMA spec for more ease of coding/optimizing for these scientific stencil applications ? like we don't bother calculating the begin address of each sub-image but always give in the original buffer pointer and position index
(i, j)
of the sub-block to be copied - the developer reasons in term of pixel, not byte or gentype. Then the async API, by taking an extranum_gentype_per_pixel
for example, manages to jump to the correct address and copy the right amount of data underlying.Below is a generic 2D2D copy and its necessary parameters in my mind:
[1] https://www.researchgate.net/profile/Muhammad-Abdul-Basit/publication/287166894_Lattice_Boltzmann_method_and_its_applications_to_fluid_flow_problems/links/5c3699c892851c22a368bf94/Lattice-Boltzmann-method-and-its-applications-to-fluid-flow-problems.pdf [2] https://www.sciencedirect.com/science/article/pii/S0898122111001064