Allow returning a SyclEvent from kernels instead of always adding a wait.

diptorupd commented 1 year ago

Numba-dpex only allows synchronous execution of kernels by adding a wait call before returning from a kernel call. The limitation is a legacy of supporting NumPy arrays as kernel arguments.

Since, NumPy arrays as kernel arguments is deprecated in numba-dpex 0.20, we should be able to return dpctl.SyclEvent objects from a kernel instead of always calling wait.

Related #147 #769

Originally posted by @diptorupd in https://github.com/IntelPython/numba-dpex/issues/816#issuecomment-1433983402

diptorupd commented 1 year ago

@oleksandr-pavlyk @ZzEeKkAa @fcharras @beckermr

The initial design for the implementation of the feature can be as follows:

Make the dpctl.SyclEvent class recognizable and supported by Numba type inferencing, i.e., add support for typeof, unbox, box for the dpctl.SyclEvent type.
Define a syntax for passing and returning events from kernel function calls. An initial design can be as follows:

import numba_dpex as dpex
import dpnp
import dpctl

@dpex.kernel(nowait=True)
def foo(a,b,c):
    i = dpex.get_global_id(0)
    c[i] = a[i] + b[i]

q = dpctl.SyclQueue("gpu")
a = dpnp.ones(10, sycl_queue=q)
b = dpnp.ones(10, sycl_queue=q)
c = dpnp.zero(10, sycl_queue=q)

dep_events_list =[]
e1 = foo[dpex.Range(10), dep_events_list](a,b,c)
e2 = foo(dpex.Range(10), [e1]](c,b,a)
e2.wait()

In the snippet above, I am proposing adding a new flag to the kernel decorator called nowait. The nowait option will define the behavior of the kernel dispatch. If nowait is set to True the kernel will be dispatched asynchronously, i.e., returns an dpctl.SyclEventobject. If the option is set to False, then kernel dispatch keeps the current behavior where we apply a wait inside the kernel launch. Thus, in the example snippet the second call to foo ensures that the second kernel launch adds a dependency on the event (e1) returned by the first call to foo.

I also think the nowait default should be False and it should be explicitly set by users who want async execution.

Moreover, the lifetime management of the kernel arguments is deferred to users. It is their responsibility to keep the kernel arguments "alive" if they want to use the nowait=True option.

The proposal applies only to kernel decorated functions and dpjit will not support async execution for parfor nodes.

beckermr commented 1 year ago

I assume the [e1] syntax in the kernel execution call for the second kernel tells the second kernel to wait for the first?

diptorupd commented 1 year ago

I assume the [e1] syntax in the kernel execution call for the second kernel tells the second kernel to wait for the first?

Yes, after these changes the kernel launch will take a dpex.Range or a dpex.NdRange object specifying the index space for the execution and an optional list of dependent events.

In my example, the second call to foo ensures that the second kernel launch adds a dependency on the event returned by the first call to foo.

oleksandr-pavlyk commented 1 year ago

I think delegating Python object lifetime management to the user is too restrictive, and sets naïve user up for crashes. If a function needs to create a temporary to be used by a numba-dpex kernel, author must ensure that the temporary is available until the kernel has completed its execution. Failure to do that causes a crash. The only means available to ensure the temporary buffer persists is to call e1.wait() before returning from the function.

diptorupd commented 1 year ago

I think delegating Python object lifetime management to the user is too restrictive, and sets naïve user up for crashes.

That is why I am proposing keeping the default option for nowait as False. So, it is really the responsibility of the user to know what they are doing.

oleksandr-pavlyk commented 1 year ago

It should be made very clear that user must ensure that USM allocations kernels are to process must not be deallocated sooner than all tasks associated with the submitted DAG of kernels complete execution.

We need to provide tools for users to accomplish this. Here are few ideas:

Wait for all tasks to complete, i.e. call sycl_event.wait()
User may return an object that contains return sycl_event as well as references to Python objects owning USM allocations, such as dpctl.tensor.usm_ndarray
dpctl.utils could expose a Python API to increment ref-counts of provided list of Python objects and schedule a host_task to provided sycl::queue that depends on provided to list of dpctl.SyclEvent objects to decrement those ref-counts.

The latter is what dpctl.tensor implementations do. It would be up to the user which method to use.

It is a good idea to illustrate each of these uses with examples in some tutorial, on either numba-dpex or dpctl side.

ZzEeKkAa commented 1 year ago

Checklist on the steps needed to implement this feature:

[x] 1. #1137
[x] 1.1. Implement unboxing of Python dpctl.SyclEvent to Numba syclevent type.
[x] 1.2. Implement boxing Numba syclevent back to Python
[x] 2. #1180
[x] 3. #1138
[x] 4. #1181
[x] 5. #1182
[x] 6. #1183
[x] 7. #1184
[x] 8. #1185
[x] 9. Create design for dispatcher dependent events
[x] 10. Make it possible for a kernel to return a SyclEvent
[x] 11. Implement dispatcher dependent events

diptorupd commented 10 months ago

@ZzEeKkAa can this ticket be closed now after the merge of #1249?

ZzEeKkAa commented 10 months ago

Resolved with #1249 merge

IntelPython / numba-dpex

Allow returning a SyclEvent from kernels instead of always adding a wait. #916