Adding multikernel example

This test demonstrates launching multiple kernels and synchronizing between them in an extremely primitive way. The generic kernel simply accepts an input/output buffer and a synchronization location. It then modifies the buffer and synchronizes to the barrier when done.

Turns out, this totally just works using the CUDA-lite infrastructure, which is great. But this example will demonstrate to users how to do it, as well as provide regression. Would appreciate any style feedback or comment requests

bespoke-silicon-group / bsg_replicant

Adding multikernel example #806