Custom CUDA kernel synchronization

FloopCZ commented 1 year ago

Hi, I have a question regarding custom CUDA kernels and synchronization. I tried to proceed as described in Interoperability with CUDA which states:

Notice that since ArrayFire and your kernels are sharing the same CUDA stream, there is no need to perform any synchronization operations as operations within a stream are executed in order.

My code is more or less as follows:

af::array my_custom_cuda_op(const af::array& input)
{
    // Evaluate input matrix.
    input.eval();
    double* pinput = state.device<double>();

    // Allocate output matrix.
    af::array output{input.dims(), input.type()};
    double* poutput = output.device<double>();

    // Determine ArrayFire's CUDA stream.
    int af_id = af::getDevice();
    int cuda_id = afcu::getNativeId(af_id);
    cudaStream_t af_cuda_stream = afcu::getStream(cuda_id);

    // Call CUDA kernel.
    dim3 block(32, 32);
    dim3 grid(..., ...);
    my_kernel<<<grid, block, shared_bytes, af_cuda_stream>>>(...);

    // Return matrices to ArrayFire.
    input.unlock();
    output.unlock();

    return output;
}

I am using the same stream as ArrayFire, however, the program produces invalid results unless I manually run cudaStreamSynchronize(af_cuda_stream) after launching the kernel. Am I doing something wrong? Thank you.

syurkevi commented 1 year ago

Where is state coming from? Was that meant to be input? just making sure.

  // Evaluate input matrix.
    input.eval();
    double* pinput = state.device<double>();

If it is correct then state should be unlocked instead of input.

FloopCZ commented 1 year ago

Sorry, it is a typo when copying the example, the state is the input (it is called state in the original program). Yes, the function unlocks the input and it does not use any external resources, it is a standalone function utilizing only its parameters and returning the output.

FloopCZ commented 1 year ago

Hi @syurkevi , I believe I found the source of the issue. I noticed that the issue only arises when there are multiple GPUs in the system. In single-gpu systems, it works as expected, which suggested that there may be a problem with the device id or stream id. Peeking at getStream(id) function here gave me the impression that the function expects arrayfire id of the gpu, not the native CUDA id. Indeed, using the arrayfire id instead of the native id fixes the issue. If you confirm my suspicion, we can fix the documentation of CUDA interoperability.

syurkevi commented 1 year ago

Yes, confirming that the id is expected to be the internal ArrayFire id. Our example is wrong. The documentation for getStream does detail the expected id type. The interop.md looks like it was incorrectly changed at some point and should be reverted. Ideally this code should be added to our tests and forwarded to the docs through a snippet.

arrayfire / arrayfire

Custom CUDA kernel synchronization #3404