AdvancedPhotonSource / tike

Repository for ptychography software
http://tike.readthedocs.io
Other
29 stars 15 forks source link

Cuda kernels are written inefficiently #81

Closed nikitinvv closed 1 year ago

nikitinvv commented 4 years ago

Cuda kernels in usfft.cu, convolution.cu are not optimal.

They are heavy and use a huge number of registers which may significantly slowdown the code. They should be split into smaller ones.

  1. If there exist 'if,else' statement like for fwd/adj operator then it is better to split it into 2, like it was at the beginning.

  2. Loops with many iterations inside the kernels should be avoided by adding new threads, this way allowing the scheduler to switch threads more efficiently, like it was at the beginning

  3. Non-sequential and uncoallesced memory access in the loops make the code slower, the following loop structure is unacceptable from cuda optimization point of view. It also looks unreadable.

    // for each image
    for (int ti = blockIdx.z; ti < nimage; ti += gridDim.z) {
    // for each scan position
    for (int ts = blockIdx.y; ts < nscan; ts += gridDim.y) {
      ....
      // for x,y coords in patch
      for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
        for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {

    Use 3d thread blocks and grids associated with each array dimension, like it was at the beginning. This will give you natural coalesced memory access.

  4. Block size should be a power of 2, optimal combinations (1024,1,1), (256,4,4), (32,32,1), (16,16,4).. What we have in the code:

    block = (min(self.scatter_kernel.max_threads_per_block, (2 * m)**3),)

    Say m=3, then block = 216 - not optimal.

Would It be better to return to my initial kernels implementations?

carterbox commented 4 years ago

Related to https://github.com/tomography/tike/pull/69#pullrequestreview-421461953

carterbox commented 4 years ago
for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
        for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {

3.

carterbox commented 4 years ago
  1. We can make the block-size always a power of 2. We just need to write a function that finds the next power of two after or equal to a number.