KhronosGroup / OpenCL-TTL

Tensor Tiling Library
Apache License 2.0
33 stars 4 forks source link

Whether it can support the mobile GPU well? #5

Open ckhfor opened 1 year ago

ckhfor commented 1 year ago

I learned from the Adreno GPU optimization manual(https://developer.qualcomm.com/download/adrenosdk/adreno-opencl-programming-guide.pdf?referrer=node/6114): Avoid using the function called async_work_group_copy. It is often tricky for the compiler to generate the optimal code to load local memory, and better for developers to write code that manually loads data into local memory

Can you do some special optimization for mobile GPUs? Thanks~~

chrisgearing commented 1 year ago

Sorry for the slow response; been on vacation.

We only use builtin async_work_group_copy(3D3D) when compiling for OpenCL; in the C version, a hand-written async_work_group_copy is used, and in fact for OpenCL builds that do not support async_work_group_copy we do similar.

See defining TTL_COPY_3D

Are you talking about not using async_work_group_copy in the OpenCL environment, if so, then I guess we need to provide some way of redirecting.

Maybe

ifndef HostLocalTransfer

define HostLocalTransfer async_work_group_copy3D3D

endif

Something like this?

On the second question, what sort of optimizations? We want to keep it as something that supports a broad church, but obviously, anything that helps we would be happy to try and add.