Closed nikitinvv closed 1 year ago
for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {
3.
gridDim.x >= patch_shape
and blockDim.x >= patch_shape
. The difference is that grid-stride-loops will execute correctly for any block and grid shape; you can still tune the kernel launches by adjusting the block and grid shapes.
int py = blockIdx.x;
if (py >= patch_shape) return;
int px = threadIdx.x;
if (px >= patch_shape) return;
if () return;
statement if the problem size is smaller than the grid.
// for each image
for (int ti = blockIdx.z; ti < nimage; ti += gridDim.z) {
// for each scan position
for (int ts = blockIdx.y; ts < nscan; ts += gridDim.y) {
Cuda kernels in usfft.cu, convolution.cu are not optimal.
They are heavy and use a huge number of registers which may significantly slowdown the code. They should be split into smaller ones.
If there exist 'if,else' statement like for fwd/adj operator then it is better to split it into 2, like it was at the beginning.
Loops with many iterations inside the kernels should be avoided by adding new threads, this way allowing the scheduler to switch threads more efficiently, like it was at the beginning
Non-sequential and uncoallesced memory access in the loops make the code slower, the following loop structure is unacceptable from cuda optimization point of view. It also looks unreadable.
Use 3d thread blocks and grids associated with each array dimension, like it was at the beginning. This will give you natural coalesced memory access.
Block size should be a power of 2, optimal combinations (1024,1,1), (256,4,4), (32,32,1), (16,16,4).. What we have in the code:
Say m=3, then block = 216 - not optimal.
Would It be better to return to my initial kernels implementations?