About 2D2D async_copy on RGB images and other stencil codes

Imagine we read an RGB image with three uchar per pixel (uchar[3]). All pixels of image are written (packed) into a cl_mem buffer. Then we want to use async_work_group_copy_2D2D() to optimize memory transfer between __global and __local.

The point is:

If we use uchar3 vector type. The OpenCL 1.2 spec defined that "async_work_group_copy and async_work_group_strided_copy for 3-component vector types behave as async_work_group_copy and async_work_group_strided_copy respectively for 4-component vector types.", probably because 3-component vectors are aligned to 4-component ones. As a result, we will fall into a case where we have uchar4-like pointer arithmetic on a real-packed-uchar[3] buffer, that turns out to be very error-prone.
This drawback can be avoided by fallbacking to the unit uchar interface and multiplying by three the associated num_elements_per_line and src/dst strides. But this adds more verbosity (and complexity) to the kernel code.
More, on other stencil codes where the cell composition is none of the specified vector type in the spec (for example float9 for a D2Q9 (two-dimensional) [1] Lattice Boltzmann method (LBM) solver or float19 for a D3Q19 (three-dimensional) [2]). We would have only the choice of the unit float interface of async_work_group_copy_2D2D, and the corresponding address-/stride-calculation will be a headache.

I'm wondering if we can improve the new async DMA spec for more ease of coding/optimizing for these scientific stencil applications ? like we don't bother calculating the begin address of each sub-image but always give in the original buffer pointer and position index (i, j) of the sub-block to be copied - the developer reasons in term of pixel, not byte or gentype. Then the async API, by taking an extra num_gentype_per_pixel for example, manages to jump to the correct address and copy the right amount of data underlying.

Below is a generic 2D2D copy and its necessary parameters in my mind:

copy_2D2D

[1] https://www.researchgate.net/profile/Muhammad-Abdul-Basit/publication/287166894_Lattice_Boltzmann_method_and_its_applications_to_fluid_flow_problems/links/5c3699c892851c22a368bf94/Lattice-Boltzmann-method-and-its-applications-to-fluid-flow-problems.pdf [2] https://www.sciencedirect.com/science/article/pii/S0898122111001064

KhronosGroup / OpenCL-Docs

About 2D2D async_copy on RGB images and other stencil codes #573