Cuda kernels are written inefficiently

nikitinvv commented 4 years ago

Cuda kernels in usfft.cu, convolution.cu are not optimal.

They are heavy and use a huge number of registers which may significantly slowdown the code. They should be split into smaller ones.

If there exist 'if,else' statement like for fwd/adj operator then it is better to split it into 2, like it was at the beginning.
Loops with many iterations inside the kernels should be avoided by adding new threads, this way allowing the scheduler to switch threads more efficiently, like it was at the beginning
Non-sequential and uncoallesced memory access in the loops make the code slower, the following loop structure is unacceptable from cuda optimization point of view. It also looks unreadable.
```
// for each image
for (int ti = blockIdx.z; ti < nimage; ti += gridDim.z) {
// for each scan position
for (int ts = blockIdx.y; ts < nscan; ts += gridDim.y) {
  ....
  // for x,y coords in patch
  for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
    for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {
```
Use 3d thread blocks and grids associated with each array dimension, like it was at the beginning. This will give you natural coalesced memory access.
Block size should be a power of 2, optimal combinations (1024,1,1), (256,4,4), (32,32,1), (16,16,4).. What we have in the code:
```
block = (min(self.scatter_kernel.max_threads_per_block, (2 * m)**3),)
```
Say m=3, then block = 216 - not optimal.

Would It be better to return to my initial kernels implementations?

carterbox commented 4 years ago

for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
        for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {

3.

carterbox commented 4 years ago

We can make the block-size always a power of 2. We just need to write a function that finds the next power of two after or equal to a number.

AdvancedPhotonSource / tike