Closed haasn closed 6 years ago
Yes, it's also something I want to do, as I mentioned here. However, with gradient calculation shared, we still need to sample all 9 samples to apply gaussian kernel on it, with 2 shmem fetchs (instead of one) per sample. The reduced gradient calculation on the other hand, is actually very fast to compute. Still, the performance need to be benchmarked, but I don't expect the change to improve the performance much.
ravu-lite
simplified the gradient calculation, and the expected performance gain is further reduced
Depending on shmem requirements, it might be possible to share the gradients as well. Basically, instead of just sampling the input texture for each thread, sample the entire 4x4 quad (use textureGather) and store both the local sample and the gradient to its neighbours into the shmem arrays.