Closed diptorupd closed 10 months ago
@oleksandr-pavlyk @ZzEeKkAa @fcharras @beckermr
The initial design for the implementation of the feature can be as follows:
dpctl.SyclEvent
class recognizable and supported by Numba type inferencing, i.e., add support for typeof
, unbox
, box
for the dpctl.SyclEvent
type.import numba_dpex as dpex
import dpnp
import dpctl
@dpex.kernel(nowait=True)
def foo(a,b,c):
i = dpex.get_global_id(0)
c[i] = a[i] + b[i]
q = dpctl.SyclQueue("gpu")
a = dpnp.ones(10, sycl_queue=q)
b = dpnp.ones(10, sycl_queue=q)
c = dpnp.zero(10, sycl_queue=q)
dep_events_list =[]
e1 = foo[dpex.Range(10), dep_events_list](a,b,c)
e2 = foo(dpex.Range(10), [e1]](c,b,a)
e2.wait()
In the snippet above, I am proposing adding a new flag to the kernel
decorator called nowait
. The nowait
option will define the behavior of the kernel dispatch. If nowait
is set to True
the kernel will be dispatched asynchronously, i.e., returns an dpctl.SyclEvent
object. If the option is set to False
, then kernel dispatch keeps the current behavior where we apply a wait inside the kernel launch. Thus, in the example snippet the second call to foo
ensures that the second kernel launch adds a dependency on the event (e1
) returned by the first call to foo
.
I also think the nowait
default should be False
and it should be explicitly set by users who want async execution.
Moreover, the lifetime management of the kernel arguments is deferred to users. It is their responsibility to keep the kernel arguments "alive" if they want to use the nowait=True
option.
The proposal applies only to kernel
decorated functions and dpjit
will not support async execution for parfor nodes.
I assume the [e1]
syntax in the kernel execution call for the second kernel tells the second kernel to wait for the first?
I assume the
[e1]
syntax in the kernel execution call for the second kernel tells the second kernel to wait for the first?
Yes, after these changes the kernel launch will take a dpex.Range
or a dpex.NdRange
object specifying the index space for the execution and an optional list of dependent events.
In my example, the second call to foo
ensures that the second kernel launch adds a dependency on the event returned by the first call to foo
.
I think delegating Python object lifetime management to the user is too restrictive, and sets naïve user up for crashes. If a function needs to create a temporary to be used by a numba-dpex kernel, author must ensure that the temporary is available until the kernel has completed its execution. Failure to do that causes a crash. The only means available to ensure the temporary buffer persists is to call e1.wait()
before returning from the function.
I think delegating Python object lifetime management to the user is too restrictive, and sets naïve user up for crashes.
That is why I am proposing keeping the default option for nowait
as False
. So, it is really the responsibility of the user to know what they are doing.
It should be made very clear that user must ensure that USM allocations kernels are to process must not be deallocated sooner than all tasks associated with the submitted DAG of kernels complete execution.
We need to provide tools for users to accomplish this. Here are few ideas:
sycl_event.wait()
sycl_event
as well as references to Python objects owning USM allocations, such as dpctl.tensor.usm_ndarray
dpctl.utils
could expose a Python API to increment ref-counts of provided list of Python objects and schedule a host_task
to provided sycl::queue
that depends on provided to list of dpctl.SyclEvent
objects to decrement those ref-counts.The latter is what dpctl.tensor
implementations do. It would be up to the user which method to use.
It is a good idea to illustrate each of these uses with examples in some tutorial, on either numba-dpex or dpctl side.
Checklist on the steps needed to implement this feature:
@ZzEeKkAa can this ticket be closed now after the merge of #1249?
Resolved with #1249 merge
Numba-dpex only allows synchronous execution of kernels by adding a
wait
call before returning from a kernel call. The limitation is a legacy of supporting NumPy arrays as kernel arguments.Since, NumPy arrays as kernel arguments is deprecated in numba-dpex 0.20, we should be able to return
dpctl.SyclEvent
objects from a kernel instead of always callingwait
.Related #147 #769
Originally posted by @diptorupd in https://github.com/IntelPython/numba-dpex/issues/816#issuecomment-1433983402