Open TApplencourt opened 1 year ago
(Pinging @pszi1ard I hope it's you :) )
Yes, thanks for looping me in!
What I'd add is: an efficient way to submit barriers and obtain related events (like ext_oneapi_submit_barrier
).
For an example of how we use in-order queues and events in GROMACS this might be useful: https://gitlab.com/gromacs/gromacs/-/blob/main/src/gromacs/gpu_utils/device_event_sycl.h
One more thing I forgot to mention in the call: we implement event consumption counters as a means of correctness checking, some form of that could prove to be a useful SYCL feature.
What do you mean by event consumption counters
?
I have a big plan to propose some callback mechanism to SYCL to be able to do verification/profiling of what the runtime is doing. Maybe this has some overlaps with your event consumption
check.
For reference, our extension for this purpose: https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events
Unlike the Intel extension, our events remain semantically valid. They just are less efficient if you actually try to synchronize with them, while potentially being far more efficient to construct.
What do you mean by event consumption counters?
What we do in GROMACS is that we (in our wrapper) specify for each event "how many times it is to be waited upon". This helped us find things like missing synchronizations. By "waited upon" I mean any use as a synchronization point: event::wait()
, x.depends_on(event)
, etc.
Simple example: we do a D2H USM copy but forget to wait for it before using the data on the host. It usually worked fine because the operation is fast enough; yet, in some rare cases, the data was accessed before the copy was completed. A nasty, hard-to-debug issue. Adding this check allowed us to see, on the next step when the event was overwritten by a new one, that it was never consumed.
So, it can be helpful to have extra property saying, "The event from this submission will be used as a synchronization point N times". In the case of N=0, we get the "coarse grain event" / "discard event" behavior.
A simpler example could be only enforcing that the event must be waited upon (without specifying the exact number of times). If not, the exception is thrown at event destruction.
That is unlikely to be a big issue in a nice clean codebase, but as the spaghetti monster grows and the tasks are submitted from multiple coupled modules, such checks could help a lot.
For reference, our extension for this purpose: https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events
That works very well for us and seems like the most flexible solution.
Although, in the context of consumption counting, throwing an exception on trying to synchronize with such events might be beneficial.
Another option to consider: set a property on the queue to discard/coarsen events, yet, optionally, be able to specify, at submission, that for this operation, we want a "normal" event. But that's deep in syntactic sugar territory. That, for GROMACS, would lead to slightly more concise code.
update_host
when it is a no-op and likely returns an event but probably not a new oneabove would restrict to knowing property at compile time - is that sufficient for use cases (e.g., GROMACS) if property was on per-submission and/or queue?
For us, compile-time properties are OK.
We do have a few cases where it's decided in runtime, but we already use branches there, and it's not too bad.
property at queue and point of submission?
We need a per-submission setting. We don't use per-queue settings currently.
Simple example: we do a D2H USM copy but forget to wait for it before using the data on the host. It usually worked fine because the operation is fast enough; yet, in some rare cases, the data was accessed before the copy was completed. A nasty, hard-to-debug issue. Adding this check allowed us to see, on the next step when the event was overwritten by a new one, that it was never consumed.
Another problematic case is that of "stale" events, e.g. a conditionally produced event gets reused without being re-triggered, for instance because of a mismatch in the conditionals around the producer and consumer tasks. Such cases too is very hard to catch, and that's where our max consumption counts in GROMACS helps, since if we know a task produced data for two consumers, the corresponding event's third consumption can be treated as an error (in debug mode).
Please find below the summary of the discussion and my take on this ~issue~ opportunity for improvement.
TLDR: Based on the HipSYCL idea, I propose to add a new queue::property name hints::event_unused_mostly.
When using it, events will be implemented as barriers.
SYCL events can be implemented as barriers. Not efficient in the case of events used for specification dependency, but it follows the correct semantics. Barriers can be implemented "on the fly" for many backends. Hence no overhead when not used.
Add a new "hint" parameter to the queue Property.
sycl::queue Q{sycl::property::queue::hints::event_unused_mostly}.
e = Q.submit(); // No runtime overhead if not used;
Pro:
Cons:
coarse events
will be in used on in-order queues, where explicit depends_on dependency is uncommon. (Note that HIPSYCL allows submit
to take a property_list
hence enabling this optimization for specific submission)Use C++ magic to change the return type of the Q.submit, either an event or void.
Add a new tag parameter to submit
, used to infer the return type.
> Q.submit(cgh, dicardEvent);
Pro:
Cons:
auto
?https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_discard_queue_events.asciidoc One can add a new state to the event to specify that they are "unusable". Hence, they are easy to implement :)
Add a new parameter to the Queue property.
sycl::queue Q{property::queue::fake_event};
e1 = Q.submit();
// e1.wait(); // This will throw.
Pro:
Con:
sycl::queue Q{in_order, fake_event}
Q.submit()
Q.submit()
Q.submit().wait()
out-of-order-queue
.PS: Thanks @nliber for the help :)
@TApplencourt: Are there plans to allow per-submission properties?
We have things like
Q.submit(/*A*/);
e = Q.submit(/*B*/);
Q.submit(/*C*/);
e.wait(); // or insert `e` as a dependency into another queue
in GROMACS. If I understand the proposal correctly, that would cause over-synchronization when used per-queue.
@illuhad can confirm, but my understanding is that yes, e.wait()
will block until C is finished. Good catch!
We can extend the proposal to add the property list to the submission (as in the hipsycl extension), or add a new template parameter. This seems to be a bigger change spec-wise (as we will need to change also all the short-cut). I have no strong opinion about this. I will let other people chime in :)
@illuhad can confirm, but my understanding is that yes, e.wait() will block until C is finished. Good catch!
Yes, this is why you would want per-submission properties in some cases.
This seems to be a bigger change spec-wise (as we will need to change also all the short-cut)
In our extension, if you want to use per-submission properties, you have to use submit()
. This substantially simplifies the API and implementation. The reasoning is that the shortcuts are for the simple cases when you don't want to type much ("simple things should be simple"), but when you start optimizing and adding a lot of properties to tweak behavior, you are no longer in the simple case and can afford typing a few additional characters :-)
sycl::property::queue::hints::event_unused_mostly
:
No change in semantics.
What happens with Profiling? hipSYCL should work when profiling enabled. Implementation could ignore hint. Enabling profiling can change implementation choices (e.g., concurrency) and measure something different---no different to normal.
What happens with interop? Need to be careful.
Implementation option: decay to regular event if used as a dependency in the DAG. In practice, return event to the barrier, not the kernel. Scheduling will change in this case, but implementation specific.
Counter-point: impact on spec may not be large if return type depends on tag. Queue can act differently without being aware when inside a library. Extend existing submit
with a new argument (e.g., compile-time property) which changes the return type to void
. Becomes property of each submission not a property of the queue. Easy to write wrapper functions.
Compiler might help find bugs with this approach. Implementation experience with this approach needed!
Choice: Hint, or a semantic change.
Need a property on submit. Semantics easier to reason about if property changed return type: fail at compile time if tried to get an event.
Continue discussion.
@al42and , @pszi1ard are you submitting your own kernel, or also using this optimization with library calls? (for example oneMKL
gemv(q, ...);
gemv(q, ...);
gemv(q, ...);
q.wait();
)
Also, are you using this coarse-grain optimization and profiling event?
After feedback of today (@Pennycook, @illuhad, I hope I convey your point of view correctly).
Both options add a property_list
to the submitted property. We will not touch the queue. People can always write their own wrapper.
Add a compile time property that will change the return type of submit
to return void
.
(void) q.submit(property_list={no_event})
Pro:
submit-event
and submit-void
.
submit-event
behavior. Interopt and Profiling behavior doesn't changeCons:
Adoption of this new feature with current libraries may be challenging
event
like MKL?namespace oneapi::mkl::blas::column_major {
sycl::event gemv(sycl::queue &queue,
onemkl::transpose trans,
[...]
std::int64_t incy,
const std::vector<sycl::event> &dependencies = {})
}
No event, so cannot profile. It may not be an issue.
Maybe hard to implement to the already craziness of short-cut template
"Complexification" of the mental model of SYCL (now a submit may or may not return an event)
Same as 1, but with a new API name (enqueue
for example).
Pro:
hard to implement
part.Runtime or compile time property
e = q.submit(property_list={hint::event_not_used})
Pro:
prefetch
or work_group_size_hint
)Cons:
Wait for implementers to implement and give feedback on their experience. Should we write a "formal" specification of the two approaches?
@al42and , @pszi1ard are you submitting your own kernel, or also using this optimization with library calls? (for example oneMKL
We submit both our own kernels and oneMKL operations (or other native libraries via interop) to the same queue. For example, we have:
// q and q2 are both in-order
q.submit(/*SpreadKernel*/);
mkl::dft::compute_forward(); // or native library via h.get_native_queue<...>(), but it already does not insert any events
q.submit(/*SolveKernel*/);
mkl::dft::compute_backward(); // or native library...
q.submit(/*GatherKernel*/);
sycl::event e = q.ext_oneapi_submit_barrier();
q2.ext_oneapi_submit_barrier(e); // Or q.wait() if we need to send data via MPI to another rank
For hipSYCL, we use hipSYCL_enqueue_custom_operation
both for barriers and interop.
Also, are you using this coarse-grain optimization and profiling event?
We do have profiling mode, but it's turned on separately; having coarse-graining and profiling mutually exclusive is okay as long as runtime handles it gracefully.
In-app timings, currently, are used only for statistics; even if/when we use it for dynamic load balancing, we would be more interested in region-style measurements (e.g., the total time of all the operations in the example above).
In our extension, if you want to use per-submission properties, you have to use submit(). This substantially simplifies the API and implementation. The reasoning is that the shortcuts are for the simple cases when you don't want to type much ("simple things should be simple"), but when you start optimizing and adding a lot of properties to tweak behavior, you are no longer in the simple case and can afford typing a few additional characters :-)
That sounds reasonable.
1/ Templated Return time "Complexification" of the mental model of SYCL (now a submit may or may not return an event)
IMHO, it's less complex than "the returned event may or may not be a weird lazily-evaluated barrier".
We do have profiling mode, but it's turned on separately; having coarse-graining and profiling mutually exclusive is okay as long as runtime handles it gracefully.
In-app timings, currently, are used only for statistics; even if/when we use it for dynamic load balancing, we would be more interested in region-style measurements (e.g., the total time of all the operations in the example above).
Actually, I think mixing profliing needs with efficient scheduling is detrimental. Profiling is an activity with very different scope and requirements than those of a production run (concurrent kernels vs kernels run in isolation for reproducible measurements etc.).
I think profiling should set up as separate step and not coupled to events used for expressing dependencies.
Thanks for your replies!
To be clear, are you ok with mkl::dft::compute_forward();
creating/returning an event?
Is the assumption that latency for external library calls is not an issue? Or in a perfect world, or would you like a dft_compute_forward()
latency optimized?
IMHO, it's less complex than "the returned event may or may not be a weird lazily-evaluated barrier".
Fair point, let me edit the summary with that :)
To be clear, are you ok with mkl::dft::compute_forward(); creating/returning an event? Is the assumption that latency for external library calls is not an issue? Or in a perfect world, or would you like a dft_compute_forward() latency optimized?
We would like it to be optimized. But we have fewer external calls than custom kernels, so it's less critical.
Side note: Should we invite Celerity and HPX folks here?
Side note: Should we invite Celerity and HPX folks here?
Oh yes, please! The more, the merrier. I invited my Kokkos and Occa friends already :)
Coming late to this, I agree that having a per-submission property is better than a per-queue property.
In Kokkos
, we often use out-of-order queues with submit_barrier
and we only really care about the events we get from submit_barrier
. Thus, approach 1 sounds very reasonable (assuming we still can get the events from submit_barrier
).
I agree with @al42and, that the semantics in the first approach are clearer than in the second where the event might imply a barrier (or not).
Coming at this from the perspective of gtensor (https://github.com/wdmapp/gtensor), used by GENE (https://genecode.org). Gtensor emulates library state style API like cuda/hip, so we never return events in it's public API. We do use in-order queues (again to be more cuda like) and internally call lots of oneMKL routines for gt-blas, gt-fft, etc, and could potentially benefit from avoiding event overhead (although I doubt we are submission latency sensitive very often if at all for most workloads).
Where this affects gtensor/GENE, is viability of supporting it in oneMKL. I think even more than general case, most oneMKL ops are big and not sensitive to submit latency. However if that every became an issue, (1) makes it challenging to support in oneMKL. I lean towards (2) for this reason - it leaves a much clearer path to library support, which seems like it may be important for some applications some of the time. I think many libraries do not have a natural single function / multi op advanced interface like "submit", so if you need to support different return semantics, you have a lot of work to make a second version of every operation. (1) feels "cleaner" in that it can catch misuse at compile time, but is it worth the loss of flexibility and increase in library support complexity?
Edit: I am pretty sure gtensor etc could use this as a per-queue property and does not need per-submit. We call e.wait in a few places but I believe all could be replaced with q.wait.
If we were to follow the principal that simple things should be simple, then I think the default behavior of the "submit" APIs would be to not return an event. Simple uses of SYCL do not need an event because you either get dependencies implicitly (via accessors) or you don't need them at all (in-order queues). It does not make sense to pay for the overhead of an event in these simple cases when it is not needed.
I don't really like either of the proposals above because you get the current behavior of creating an event by default unless you specify some new property. It would be better to have the simple behavior (no event) by default, requiring the user to type something extra only in the case when they want the event.
Obviously, we cannot change the semantic of queue::submit
at this time because that would be a huge breaking API change. However, we can deprecate queue::submit
and migrate code to a new API with a new name.
We can bikeshed the name later, but could we deprecate queue::submit
in SYCL-Next and introduce some new API like queue::enqueue
? The default behavior of this API would be to not return an event.
Clearly, we still need some way to get an event. Since this is a more advanced use case, we can expect the user to type more characters. For example, we could add a new API queue::enqueue_with_event
for this purpose. (We could also overload queue::enqueue
with a compile-time property similar to the previous proposal. However, it seems weird to me for the return type to change based on a property.)
This still leaves the 34 queue
shortcut functions defined in section 4.6.5.2 (e.g. queue::single_task
, etc.) One option is to deprecate all of these also and introduce new shortcut functions with names like queue::enqueue_single_task
. Another option is to just leave all the shortcut functions the way they are now (returning an event), which is similar to some of the proposals above.
I know there is concern expressed above about expanding the API surface of the spec too much. However, this proposal does not add new APIs in the long term. It does add some new APIs in the short term, but it also deprecates some existing APIs. In the long term, those deprecated APIs will be removed, so there is no net increase over the long term.
I like the idea of a new API; it's the choice who introduces the less change to the spec (other solutions need to introduce both a new property list to submit
and the associated property).
For the shortcut
function, I think it's ok to not have them. If people want to use advanced features, they can pay the price for it. We hit here the point of contention, I do think that returning an event and out-of-order queue are sane and simple defaults. But fortunately, we don't need to settle the argument here.
We should give the possibility of users not paying the price for the event they don't use. What the default should be, is IMO another following topic.
I will edit my summary with your proportion greg.
Out of the two options described, I would prefer option (1), as I worry option (2) introduces too many subtle changes to the runtime behavior that we would need to account for; scope of synchronization, profiling, interop, etc.
Looking at option (1) I like the compile-time property idea, but I agree with @gmlueck's comment that if we are to follow the principle of keeping simple things simple, ultimately, the default behavior should be that submit
does not return an event
and the user should opt into this. If this is the goal, then I think the clearest route to achieve this would be to introduce a new interface such as a new submit
like API which doesn't return an event
by default.
If we are changing the interface, I think we'd have a few options for how to handle event
s optionally. We could as @gmlueck suggested have an alternative function that returns an event
or a compile-time property. Alternatively, we could also have overloads that take the event
as a parameter.
Additionally, if we are considering changing the queue
interface to address this problem, we may want to also consider introducing a new object rather than (or in addition to) a new member function. If we look at the use case for event
s, if you're using an in-order queue
you generally don't need to use event
s, unless you're synchronizing between multiple in-order queue
s, though in this case something like submit_barrier
would be sufficient. So we could introduce a new object to represent in-order queue
s; something like execution_stream
, that's lower-level, in-order, and would not return event
s. This would also create a clearer distinction between the in-order and out-of-order use cases in the type system, and SYCL runtime semantics.
I presented a summary of the current ideas at the SYCL WG. Slide: SYCL_Optimize_non_used_events.pdf
One possible plan of action was:
submit
or new type of queue
that doesn't return events
to future versions of SYCL.property_list
to submit.
And then add a performance hint
: event-no-used
One question that appears is whether the event as barriers
is implementable, (in the highly unlikely case when events
are used even users promised that they will not be).
I will discuss below my naive "implementation" (@illuhad is smarter and has implemented something better in hipsycl).
If profiling is enabled in the profiling queue, the hint will be ignored.
A SYCL event can encapsulate one or multiple OpenCL events, representing a number of dependencies in the same or different contexts that must be satisfied for the SYCL event to be complete.
"Luckyly/Weirdly": Nothing is said about the binding of events to the command's last SYCL command.
So a get_native -> clEnqueueBarrierWithWaitList
and return the event generated is conformant.
If we want to keep intact people illusions that SYCL is just a super thing wrapper on top of native backends, we can:
has_native_event
to the OpenCL interop specA SYCL event created from a CUstream will encapsulate that event. A single CUevent is potentially retrieved from a SYCL event if there is a valid native CUevent associated with it, otherwise nullptr is returned instead. The CUDA backend-specific free function cuda::has_native_event can be used to query whether the SYCL event has a valid native CUevent associated with it.
Barrier then cuRegisterEvent or just has_native_event
return false.
What is the behavior of handler::depends_on
for one of these "no-used" events?
Submitting a barrier, waiting on it, and then submitting the depend_on kernel. It's mimicking that the previous kernel (the kernel that generated the event where we are now depending) is already finished (for DPCPP with L0 backend it's "kind of" similar with the ZE_SERIALIZE=2. )) . This work because one needs to submit the DAG of a kernel in a valid order.
FWIW, I saw a 33% speedup on end-to-end runtime for my analytic continuation application when disabling events with the sycl::ext::oneapi::property::queue::discard_events
property.
Thomas asked me to provide a quick update here, regarding an alternative proposal we've been sketching in https://github.com/intel/llvm/pull/11491. The basic idea is that it uses free functions (with new names) which do not return events, and any developer that wants an event has to explicitly opt-in through the creation of a handler.
// Default: Do not wait for any events, do not create any events
sycl::parallel_for(q, ...);
// Wait for events
sycl::submit(q, [&](sycl::handler& h) {
h.depends_on(...);
sycl::parallel_for(h, ...);
});
// Wait for events and create an event
sycl::event e = sycl::submit_with_event(q, [&](sycl::handler& h) {
h.depends_on(...);
sycl::parallel_for(h, ...);
});
On the topic of per-submission properties, here is another, unrelated, use-case: https://github.com/unisa-hpc/SYnergy/blob/60f66bb769d4a67f83b7c2c53d2fd353e2442f9e/include/queue.hpp#L100-L112.
It is, currently, a special class with queue-like semantics. But one can easily imagine it becoming a part of a SYCL implementation. Then, being able to pass a property_list
to queue::submit
would lead to a much cleaner and more composable API than tons of different overloads / variants (submit with event and with default performance hints, submit with event and with custom frequencies, submit without an event and with default frequencies, ...). queue::submit
, with two nested lambdas, is already a cognitive burden to read. Spicing it up with macros to call the correct version based on which features the compiler supports and which features are needed here and now will not make things better. In this respect, property_list
looks even more appealing to me.
Then, being able to pass a
property_list
toqueue::submit
would lead to a much cleaner and more composable API than tons of different overloads / variants (submit with event and with default performance hints, submit with event and with custom frequencies, submit without an event and with default frequencies, ...).
The proposal from https://github.com/intel/llvm/pull/11491 already supports per-submission properties.
Building on my example above, these frequency properties could be added like so:
auto properties = sycl::ext::properties(sycl::ext::uncore_frequency(x), sycl::ext::core_frequency(y));
// Default: Do not wait for any events, do not create any events
sycl::parallel_for(q, sycl::launch_config(r, properties), ...);
// Wait for events
sycl::submit(q, [&](sycl::handler& h) {
h.depends_on(...);
sycl::parallel_for(h, sycl::launch_config(r, properties), ...);
});
// Wait for events and create an event
sycl::event e = sycl::submit_with_event(q, [&](sycl::handler& h) {
h.depends_on(...);
sycl::parallel_for(h, sycl::launch_config(r, properties), ...);
});
Do you think we need something else?
Do you think we need something else?
No, I was just mentioning it here as a tidbit relevant to the overall discussion (which has two parts to it: how to optimize events, and whether to add per-submission property_list) :)
As I understand, sycl_ext_oneapi_enqueue_functions
is, for now, just an extension, so how it will be implemented in the standard is still not fully settled.
Thomas asked me to provide a quick update here, regarding an alternative proposal we've been sketching in intel/llvm#11491. The basic idea is that it uses free functions (with new names) which do not return events, and any developer that wants an event has to explicitly opt-in through the creation of a handler.
Is this implemented anywhere now?
Is this implemented anywhere now?
A "true" implementation of these interfaces (i.e., one that removes the overheads associated with creating events) is still a work-in-progress. But if you just want to experiment with the new syntax, the University of Bristol wrote a small shim layer: https://github.com/UoB-HPC/BabelStream/blob/268315376390819147c7a8447d50a89d3220de0e/src/sycl_ext_enqueue_functions.h
During today's SYCL Advisory Panel Meeting, it was clear that we need a way of not always generating events when submitting commands.
Some ideas suggested during the call (before I forgot):
submit
(and associated short-cut?) who doesn't return event.(Pinging @pszi1ard I hope it's you :) )