more efficient handling of fast vector operations

Reading about performance tuning for cuda[1,2] it seems that simple partition function is not ideal.

Rather than scheduling one thread per q value we ought to be scheduling one thread per core with a small multiplier to hide latency. Within this scheme we need to step through the array by stride size until we are indexing beyond the end. The following gets close to pytorch performance on a simple a+b vector addition:

def cuda_partition(n, strides=10):
    """
    Constructs block and grid arguments for an *n* element vector.

    For efficiency the partition will return at most two threads per core

        const int stride = blockDim.x * gridDim.x
        const int k = blockDim.x * blockIdx.x + threadIdx.x;
        while (k < n) {
            ... process k ...
            k += stride;
        }

    *strides=k* says that each block should take at least k strides
    through the array. Striding over a vector is faster than allocating a
    separate block for each vector. The block size will be the minimum to fill
    one streaming multiprocessor and maximum to fill all of them. Use k=0 to
    suppress striding and allocate a separate processor for each vector
    element.
    """
    # Use strides instead of increasing the number of blocks
    # https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    # Use four warps per processor to hide latency
    # https://docs.nvidia.com/cuda/turing-tuning-guide/index.html
    assert strides > 0

    # Processor layout for GTX 2080 Ti
    # TODO: ask cuda for these numbers
    available_processors = 68
    cores_per_processor = 64
    threads_per_warp = 32
    overschedule = 8  # to hide latency start more threads than cores
    warps_per_processor = overschedule*cores_per_processor//threads_per_warp
    #processor_registers = 2**16  # 4 byte registers
    #processor_memory = 2**16  # additional shared memory (?)
    #cores_per_processor = 64

    min_stride = warps_per_processor*threads_per_warp
    max_stride = min_stride*available_processors
    # Determine the number of strides, minimizing the strides while keeping
    # at least num_strides.
    num_strides = max(strides, 1+(n-1)//max_stride)
    # Determine the number of cycles needed if only one processor is used
    processor_blocks = 1 + (n-1)//min_stride
    # Divvy the work over the minimum number of processors for the given strides
    num_processors = 1 + (processor_blocks-1)//num_strides
    # Grid is simply the number of processors times warps per processor, unless
    # n is tiny, in which case 
    blocks = (
        num_processors * warps_per_processor if num_processors > 2
        else 1+(n-1)//(strides*threads_per_warp))
    return dict(grid=blocks, block=threads_per_warp)

with add function

numba.cuda.jit
def add(x, y, out):
    stride = cuda.gridsize(1)
    k = cuda.grid(1)
    while k < x.size:
        out[k] = x[k] + y[k]
        k += stride

This may be less relevant in Sasmodels since there is more work done per processor. Also, there are other potential efficiency gains such as spreading the orientation and polydispersity loops across processors that may be more valuable.

[1] https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ [2] https://docs.nvidia.com/cuda/turing-tuning-guide/index.html

SasView / sasmodels

more efficient handling of fast vector operations #474