This test demonstrates launching multiple kernels and synchronizing between them in an extremely primitive way. The generic kernel simply accepts an input/output buffer and a synchronization location. It then modifies the buffer and synchronizes to the barrier when done.
Turns out, this totally just works using the CUDA-lite infrastructure, which is great. But this example will demonstrate to users how to do it, as well as provide regression. Would appreciate any style feedback or comment requests
This test demonstrates launching multiple kernels and synchronizing between them in an extremely primitive way. The generic kernel simply accepts an input/output buffer and a synchronization location. It then modifies the buffer and synchronizes to the barrier when done.
Turns out, this totally just works using the CUDA-lite infrastructure, which is great. But this example will demonstrate to users how to do it, as well as provide regression. Would appreciate any style feedback or comment requests