Open ckhfor opened 1 year ago
Sorry for the slow response; been on vacation.
We only use builtin async_work_group_copy(3D3D) when compiling for OpenCL; in the C version, a hand-written async_work_group_copy is used, and in fact for OpenCL builds that do not support async_work_group_copy we do similar.
See defining TTL_COPY_3D
Are you talking about not using async_work_group_copy in the OpenCL environment, if so, then I guess we need to provide some way of redirecting.
Maybe
Something like this?
On the second question, what sort of optimizations? We want to keep it as something that supports a broad church, but obviously, anything that helps we would be happy to try and add.
I learned from the Adreno GPU optimization manual(https://developer.qualcomm.com/download/adrenosdk/adreno-opencl-programming-guide.pdf?referrer=node/6114): Avoid using the function called async_work_group_copy. It is often tricky for the compiler to generate the optimal code to load local memory, and better for developers to write code that manually loads data into local memory
Can you do some special optimization for mobile GPUs? Thanks~~