Currently we synchronize from the host for each kernel, which is unlikely to provide competitive performance vs. standard task-based GPU programming. We should consider some way to track the streams of previous tasks, and then reuse the stream for any future tasks which depend on that task.
Currently we synchronize from the host for each kernel, which is unlikely to provide competitive performance vs. standard task-based GPU programming. We should consider some way to track the streams of previous tasks, and then reuse the stream for any future tasks which depend on that task.