Reading about performance tuning for cuda[1,2] it seems that simple partition function is not ideal.
Rather than scheduling one thread per q value we ought to be scheduling one thread per core with a small multiplier to hide latency. Within this scheme we need to step through the array by stride size until we are indexing beyond the end. The following gets close to pytorch performance on a simple a+b vector addition:
def cuda_partition(n, strides=10):
"""
Constructs block and grid arguments for an *n* element vector.
For efficiency the partition will return at most two threads per core
const int stride = blockDim.x * gridDim.x
const int k = blockDim.x * blockIdx.x + threadIdx.x;
while (k < n) {
... process k ...
k += stride;
}
*strides=k* says that each block should take at least k strides
through the array. Striding over a vector is faster than allocating a
separate block for each vector. The block size will be the minimum to fill
one streaming multiprocessor and maximum to fill all of them. Use k=0 to
suppress striding and allocate a separate processor for each vector
element.
"""
# Use strides instead of increasing the number of blocks
# https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
# Use four warps per processor to hide latency
# https://docs.nvidia.com/cuda/turing-tuning-guide/index.html
assert strides > 0
# Processor layout for GTX 2080 Ti
# TODO: ask cuda for these numbers
available_processors = 68
cores_per_processor = 64
threads_per_warp = 32
overschedule = 8 # to hide latency start more threads than cores
warps_per_processor = overschedule*cores_per_processor//threads_per_warp
#processor_registers = 2**16 # 4 byte registers
#processor_memory = 2**16 # additional shared memory (?)
#cores_per_processor = 64
min_stride = warps_per_processor*threads_per_warp
max_stride = min_stride*available_processors
# Determine the number of strides, minimizing the strides while keeping
# at least num_strides.
num_strides = max(strides, 1+(n-1)//max_stride)
# Determine the number of cycles needed if only one processor is used
processor_blocks = 1 + (n-1)//min_stride
# Divvy the work over the minimum number of processors for the given strides
num_processors = 1 + (processor_blocks-1)//num_strides
# Grid is simply the number of processors times warps per processor, unless
# n is tiny, in which case
blocks = (
num_processors * warps_per_processor if num_processors > 2
else 1+(n-1)//(strides*threads_per_warp))
return dict(grid=blocks, block=threads_per_warp)
with add function
numba.cuda.jit
def add(x, y, out):
stride = cuda.gridsize(1)
k = cuda.grid(1)
while k < x.size:
out[k] = x[k] + y[k]
k += stride
This may be less relevant in Sasmodels since there is more work done per processor. Also, there are other potential efficiency gains such as spreading the orientation and polydispersity loops across processors that may be more valuable.
Reading about performance tuning for cuda[1,2] it seems that simple partition function is not ideal.
Rather than scheduling one thread per q value we ought to be scheduling one thread per core with a small multiplier to hide latency. Within this scheme we need to step through the array by stride size until we are indexing beyond the end. The following gets close to pytorch performance on a simple a+b vector addition:
with add function
This may be less relevant in Sasmodels since there is more work done per processor. Also, there are other potential efficiency gains such as spreading the orientation and polydispersity loops across processors that may be more valuable.
[1] https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ [2] https://docs.nvidia.com/cuda/turing-tuning-guide/index.html