Hi @mn416 , this is the optimised implementation of the stencil code that uses one thread per output element. I called it "blocked" because each group of 64 warps computes a block of the output buffer. It also accesses global memory in an aligned way. I am happy to rename it or make changes.
Hi @mn416 , this is the optimised implementation of the stencil code that uses one thread per output element. I called it "blocked" because each group of 64 warps computes a block of the output buffer. It also accesses global memory in an aligned way. I am happy to rename it or make changes.