Closed bkloppenborg closed 11 years ago
Ignoring the algorithmic issues with this kernel (we should be using an non-uniform fast Fourier transform), the principle way to speed up the kernel is by increasing concurrency. In dae3eb82c34e37ea9f196d3d6f92496cb6e12892 we traded off using a local register for shared memory and increased the occupancy from 75% to 100%. This lead to a 33% improvement in performance on my ATI HD 7630m card.
100% occupancy achieved for ATI card in 1d59bfdc530f89f8e8cb3174f09f911c3e84ded4. This is as good as it gets.
Right now the local execution size is hard coded to 128 units. On newer GPUs, this limit can be increased. The function
should determine group sizes automatically by querying the OpenCL context for it's capabilities. The kernel. The kernel itself will need to be modified to ensure it doesn't read/write from/to invalid memory locations.
The DFT kernel occupies 77% of the GPU's time, so this is should be regarded as a high priority item.