[FEA]: Cache cuda.parallel builds

gevtushenko commented 4 days ago

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

The cuda.parallel module contains a time-consuming JIT compilation (build) step. For reduction, this step invokes the following C API: https://github.com/NVIDIA/cccl/blob/70a2872bcd6d18e926717abd5739851f28754476/c/parallel/include/cccl/c/reduce.h#L33 On Python end, this step returns the following structure: https://github.com/NVIDIA/cccl/blob/70a2872bcd6d18e926717abd5739851f28754476/python/cuda_parallel/cuda/parallel/experimental/__init__.py#L177-L184

that contains cubin (like an object file) along with pointers to kernels in this cubin. User code usually contains more than one invocation of an algorithm with matching parameter types.

Describe the solution you'd like

We should amortize the JIT-ting cost by caching _CCCLDeviceReduceBuildResult based on parameters affecting C++ codegen. For instance, reduce_into = cudax.reduce_into(d_output, d_output, op, h_init) the following parameters affect codegen (examples in paranthesis are meant to indicate different cache entries):

compute capability of current GPU (cc field of _CCCLDeviceReduceBuildResult)
type of input sequence (container, counting iteretor, zip iterator, etc)
dtype of input sequence (int32, uint64, etc)
type of output sequence (container, counting iteretor, zip iterator, etc)
operator source code (different function bodies, maybe could be op.__code__.co_code?)
dtype of initial value

Let's say we build two reductions as follows:

import numpy
from numba import cuda

import cuda.parallel.experimental as cudax

def op(a, b):
    return a if a < b else b

dtype = numpy.int32
h_init = numpy.array([42], dtype)
h_input = numpy.array([8, 6, 7, 5, 3, 0, 9], dtype)
d_output = cuda.device_array(1, dtype)
d_input = cuda.to_device(h_input)

red1 = cudax.reduce_into(d_output, d_output, op, h_init)
red2 = cudax.reduce_into(d_output, d_output, op, h_init)

Since parameters described above match between incocations of cudax.reduce_into, the second call should not lead to invocation of extern "C" CCCL_C_API CUresult cccl_device_reduce_build.

Describe alternatives you've considered

No response

Additional context

No response

gevtushenko commented 4 days ago

@leofang might have some input on how we should approach caching (on-disk storage etc.)

leofang commented 4 days ago

Yeah, in cuda.core we'll offer various caches in the Python level, see https://github.com/NVIDIA/cuda-python/issues/176 (can't link the internal design doc here, but we know where to find it 🙂). Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

gevtushenko commented 4 days ago

Not sure if the CCCL C library can easily hook up with a Python-based cache, though.

@leofang we could consider caching on C++ end. I was thinking about caching on the Python side of cuda.parallel for now instead if this makes this any easier.

leofang commented 4 days ago

Yeah then cuda.core should meet your need. You just need to figure a way to generate a stable hash key for the items you listed.

NVIDIA / cccl