Try using GL_NV_shader_thread_shuffle

https://www.khronos.org/registry/OpenGL/extensions/NV/NV_shader_thread_shuffle.txt

Not sure if AMD/intel implement this too. If not, then it's probably not worth trying.

In theory, this would allow us to directly share samples between threads in the same warp without going through shmem, which should be even faster. I believe the change required would be essentially rewriting the code that loads the samples (float lumaNN = ...) to load them in groups of 32 where each thread loads one value and then uses the warp exchange primitives to directly shuffle them with the other 31 threads.

bjin / mpv-prescalers

Try using GL_NV_shader_thread_shuffle #12