Closed kloppstock closed 7 years ago
Did you put the asynchronous memcopies into individual streams?
Every stream calculates a block of images. Therefore every stream does (in order):
This should allow multiple kernels to work concurrently in different stages.
Write functions to stream the data to the GPU.