KhronosGroup / SYCL-Docs

SYCL Open Source Specification
Other
109 stars 67 forks source link

Submision optimization when event are not used #404

Open TApplencourt opened 1 year ago

TApplencourt commented 1 year ago

During today's SYCL Advisory Panel Meeting, it was clear that we need a way of not always generating events when submitting commands.

Some ideas suggested during the call (before I forgot):

(Pinging @pszi1ard I hope it's you :) )

pszi1ard commented 1 year ago

(Pinging @pszi1ard I hope it's you :) )

Yes, thanks for looping me in!

What I'd add is: an efficient way to submit barriers and obtain related events (like ext_oneapi_submit_barrier).

For an example of how we use in-order queues and events in GROMACS this might be useful: https://gitlab.com/gromacs/gromacs/-/blob/main/src/gromacs/gpu_utils/device_event_sycl.h

pszi1ard commented 1 year ago

One more thing I forgot to mention in the call: we implement event consumption counters as a means of correctness checking, some form of that could prove to be a useful SYCL feature.

TApplencourt commented 1 year ago

What do you mean by event consumption counters?

I have a big plan to propose some callback mechanism to SYCL to be able to do verification/profiling of what the runtime is doing. Maybe this has some overlaps with your event consumption check.

illuhad commented 1 year ago

For reference, our extension for this purpose: https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events

Unlike the Intel extension, our events remain semantically valid. They just are less efficient if you actually try to synchronize with them, while potentially being far more efficient to construct.

al42and commented 1 year ago

What do you mean by event consumption counters?

What we do in GROMACS is that we (in our wrapper) specify for each event "how many times it is to be waited upon". This helped us find things like missing synchronizations. By "waited upon" I mean any use as a synchronization point: event::wait(), x.depends_on(event), etc.

Simple example: we do a D2H USM copy but forget to wait for it before using the data on the host. It usually worked fine because the operation is fast enough; yet, in some rare cases, the data was accessed before the copy was completed. A nasty, hard-to-debug issue. Adding this check allowed us to see, on the next step when the event was overwritten by a new one, that it was never consumed.

So, it can be helpful to have extra property saying, "The event from this submission will be used as a synchronization point N times". In the case of N=0, we get the "coarse grain event" / "discard event" behavior.

A simpler example could be only enforcing that the event must be waited upon (without specifying the exact number of times). If not, the exception is thrown at event destruction.

That is unlikely to be a big issue in a nice clean codebase, but as the spaghetti monster grows and the tasks are submitted from multiple coupled modules, such checks could help a lot.

For reference, our extension for this purpose: https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events

That works very well for us and seems like the most flexible solution.

Although, in the context of consumption counting, throwing an exception on trying to synchronize with such events might be beneficial.

Another option to consider: set a property on the queue to discard/coarsen events, yet, optionally, be able to specify, at submission, that for this operation, we want a "normal" event. But that's deep in syntactic sugar territory. That, for GROMACS, would lead to slightly more concise code.

tomdeakin commented 1 year ago
al42and commented 1 year ago

above would restrict to knowing property at compile time - is that sufficient for use cases (e.g., GROMACS) if property was on per-submission and/or queue?

For us, compile-time properties are OK.

We do have a few cases where it's decided in runtime, but we already use branches there, and it's not too bad.

property at queue and point of submission?

We need a per-submission setting. We don't use per-queue settings currently.

pszi1ard commented 1 year ago

Simple example: we do a D2H USM copy but forget to wait for it before using the data on the host. It usually worked fine because the operation is fast enough; yet, in some rare cases, the data was accessed before the copy was completed. A nasty, hard-to-debug issue. Adding this check allowed us to see, on the next step when the event was overwritten by a new one, that it was never consumed.

Another problematic case is that of "stale" events, e.g. a conditionally produced event gets reused without being re-triggered, for instance because of a mismatch in the conditionals around the producer and consumer tasks. Such cases too is very hard to catch, and that's where our max consumption counts in GROMACS helps, since if we know a task produced data for two consumers, the corresponding event's third consumption can be treated as an error (in debug mode).

TApplencourt commented 1 year ago

Please find below the summary of the discussion and my take on this ~issue~ opportunity for improvement.

TLDR: Based on the HipSYCL idea, I propose to add a new queue::property name hints::event_unused_mostly. When using it, events will be implemented as barriers.

Objective: Reducing the runtime overhead of unused events

1/ "Event as Barrier" [ coarse-grain event in HipSYCL parlance]

https://github.com/OpenSYCL/OpenSYCL/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events

SYCL events can be implemented as barriers. Not efficient in the case of events used for specification dependency, but it follows the correct semantics. Barriers can be implemented "on the fly" for many backends. Hence no overhead when not used.

Implementation

Add a new "hint" parameter to the queue Property.

sycl::queue Q{sycl::property::queue::hints::event_unused_mostly}.
e = Q.submit(); // No runtime overhead if not used;

Pro:

Cons:

Discarded Proposal:

2/ Multiple return-type

Use C++ magic to change the return type of the Q.submit, either an event or void.

Implementation

Add a new tag parameter to submit, used to infer the return type.

> Q.submit(cgh, dicardEvent); 

Pro:

Cons:

3/ "Invalid" event

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_discard_queue_events.asciidoc One can add a new state to the event to specify that they are "unusable". Hence, they are easy to implement :)

Implementation

Add a new parameter to the Queue property.

  sycl::queue Q{property::queue::fake_event};
  e1 = Q.submit();
  // e1.wait();  // This will throw.

Pro:

Con:

PS: Thanks @nliber for the help :)

al42and commented 1 year ago

@TApplencourt: Are there plans to allow per-submission properties?

We have things like

Q.submit(/*A*/);
e = Q.submit(/*B*/);
Q.submit(/*C*/);
e.wait(); // or insert `e` as a dependency into another queue

in GROMACS. If I understand the proposal correctly, that would cause over-synchronization when used per-queue.

TApplencourt commented 1 year ago

@illuhad can confirm, but my understanding is that yes, e.wait() will block until C is finished. Good catch!

We can extend the proposal to add the property list to the submission (as in the hipsycl extension), or add a new template parameter. This seems to be a bigger change spec-wise (as we will need to change also all the short-cut). I have no strong opinion about this. I will let other people chime in :)

illuhad commented 1 year ago

@illuhad can confirm, but my understanding is that yes, e.wait() will block until C is finished. Good catch!

Yes, this is why you would want per-submission properties in some cases.

This seems to be a bigger change spec-wise (as we will need to change also all the short-cut)

In our extension, if you want to use per-submission properties, you have to use submit(). This substantially simplifies the API and implementation. The reasoning is that the shortcuts are for the simple cases when you don't want to type much ("simple things should be simple"), but when you start optimizing and adding a lot of properties to tweak behavior, you are no longer in the simple case and can afford typing a few additional characters :-)

tomdeakin commented 1 year ago

sycl::property::queue::hints::event_unused_mostly:

No change in semantics.

What happens with Profiling? hipSYCL should work when profiling enabled. Implementation could ignore hint. Enabling profiling can change implementation choices (e.g., concurrency) and measure something different---no different to normal.

What happens with interop? Need to be careful.

Implementation option: decay to regular event if used as a dependency in the DAG. In practice, return event to the barrier, not the kernel. Scheduling will change in this case, but implementation specific.

Counter-point: impact on spec may not be large if return type depends on tag. Queue can act differently without being aware when inside a library. Extend existing submit with a new argument (e.g., compile-time property) which changes the return type to void. Becomes property of each submission not a property of the queue. Easy to write wrapper functions. Compiler might help find bugs with this approach. Implementation experience with this approach needed!

Choice: Hint, or a semantic change.

Need a property on submit. Semantics easier to reason about if property changed return type: fail at compile time if tried to get an event.

Continue discussion.

TApplencourt commented 1 year ago

@al42and , @pszi1ard are you submitting your own kernel, or also using this optimization with library calls? (for example oneMKL

gemv(q, ...);
gemv(q, ...);
gemv(q, ...);
q.wait();

)

Also, are you using this coarse-grain optimization and profiling event?

TApplencourt commented 1 year ago

After feedback of today (@Pennycook, @illuhad, I hope I convey your point of view correctly). Both options add a property_list to the submitted property. We will not touch the queue. People can always write their own wrapper.

1/ Templated Return time

Add a compile time property that will change the return type of submit to return void.

 (void) q.submit(property_list={no_event})

Pro:

Cons:

1.1

Same as 1, but with a new API name (enqueue for example). Pro:

2/ "Event-as-barrier"

Runtime or compile time property

 e = q.submit(property_list={hint::event_not_used})

Pro:

Cons:

Next Step:

Wait for implementers to implement and give feedback on their experience. Should we write a "formal" specification of the two approaches?

al42and commented 1 year ago

@al42and , @pszi1ard are you submitting your own kernel, or also using this optimization with library calls? (for example oneMKL

We submit both our own kernels and oneMKL operations (or other native libraries via interop) to the same queue. For example, we have:

// q and q2 are both in-order
q.submit(/*SpreadKernel*/);
mkl::dft::compute_forward(); // or native library via h.get_native_queue<...>(), but it already does not insert any events
q.submit(/*SolveKernel*/);
mkl::dft::compute_backward(); // or native library...
q.submit(/*GatherKernel*/);
sycl::event e = q.ext_oneapi_submit_barrier();
q2.ext_oneapi_submit_barrier(e); // Or q.wait() if we need to send data via MPI to another rank

For hipSYCL, we use hipSYCL_enqueue_custom_operation both for barriers and interop.

Also, are you using this coarse-grain optimization and profiling event?

We do have profiling mode, but it's turned on separately; having coarse-graining and profiling mutually exclusive is okay as long as runtime handles it gracefully.

In-app timings, currently, are used only for statistics; even if/when we use it for dynamic load balancing, we would be more interested in region-style measurements (e.g., the total time of all the operations in the example above).

In our extension, if you want to use per-submission properties, you have to use submit(). This substantially simplifies the API and implementation. The reasoning is that the shortcuts are for the simple cases when you don't want to type much ("simple things should be simple"), but when you start optimizing and adding a lot of properties to tweak behavior, you are no longer in the simple case and can afford typing a few additional characters :-)

That sounds reasonable.

1/ Templated Return time "Complexification" of the mental model of SYCL (now a submit may or may not return an event)

IMHO, it's less complex than "the returned event may or may not be a weird lazily-evaluated barrier".

pszi1ard commented 1 year ago

We do have profiling mode, but it's turned on separately; having coarse-graining and profiling mutually exclusive is okay as long as runtime handles it gracefully.

In-app timings, currently, are used only for statistics; even if/when we use it for dynamic load balancing, we would be more interested in region-style measurements (e.g., the total time of all the operations in the example above).

Actually, I think mixing profliing needs with efficient scheduling is detrimental. Profiling is an activity with very different scope and requirements than those of a production run (concurrent kernels vs kernels run in isolation for reproducible measurements etc.).

I think profiling should set up as separate step and not coupled to events used for expressing dependencies.

TApplencourt commented 1 year ago

Thanks for your replies!

To be clear, are you ok with mkl::dft::compute_forward(); creating/returning an event? Is the assumption that latency for external library calls is not an issue? Or in a perfect world, or would you like a dft_compute_forward() latency optimized?

IMHO, it's less complex than "the returned event may or may not be a weird lazily-evaluated barrier".

Fair point, let me edit the summary with that :)

al42and commented 1 year ago

To be clear, are you ok with mkl::dft::compute_forward(); creating/returning an event? Is the assumption that latency for external library calls is not an issue? Or in a perfect world, or would you like a dft_compute_forward() latency optimized?

We would like it to be optimized. But we have fewer external calls than custom kernels, so it's less critical.

Side note: Should we invite Celerity and HPX folks here?

TApplencourt commented 1 year ago

Side note: Should we invite Celerity and HPX folks here?

Oh yes, please! The more, the merrier. I invited my Kokkos and Occa friends already :)

masterleinad commented 1 year ago

Coming late to this, I agree that having a per-submission property is better than a per-queue property. In Kokkos, we often use out-of-order queues with submit_barrier and we only really care about the events we get from submit_barrier. Thus, approach 1 sounds very reasonable (assuming we still can get the events from submit_barrier).

I agree with @al42and, that the semantics in the first approach are clearer than in the second where the event might imply a barrier (or not).

bd4 commented 1 year ago

Coming at this from the perspective of gtensor (https://github.com/wdmapp/gtensor), used by GENE (https://genecode.org). Gtensor emulates library state style API like cuda/hip, so we never return events in it's public API. We do use in-order queues (again to be more cuda like) and internally call lots of oneMKL routines for gt-blas, gt-fft, etc, and could potentially benefit from avoiding event overhead (although I doubt we are submission latency sensitive very often if at all for most workloads).

Where this affects gtensor/GENE, is viability of supporting it in oneMKL. I think even more than general case, most oneMKL ops are big and not sensitive to submit latency. However if that every became an issue, (1) makes it challenging to support in oneMKL. I lean towards (2) for this reason - it leaves a much clearer path to library support, which seems like it may be important for some applications some of the time. I think many libraries do not have a natural single function / multi op advanced interface like "submit", so if you need to support different return semantics, you have a lot of work to make a second version of every operation. (1) feels "cleaner" in that it can catch misuse at compile time, but is it worth the loss of flexibility and increase in library support complexity?

Edit: I am pretty sure gtensor etc could use this as a per-queue property and does not need per-submit. We call e.wait in a few places but I believe all could be replaced with q.wait.

gmlueck commented 1 year ago

If we were to follow the principal that simple things should be simple, then I think the default behavior of the "submit" APIs would be to not return an event. Simple uses of SYCL do not need an event because you either get dependencies implicitly (via accessors) or you don't need them at all (in-order queues). It does not make sense to pay for the overhead of an event in these simple cases when it is not needed.

I don't really like either of the proposals above because you get the current behavior of creating an event by default unless you specify some new property. It would be better to have the simple behavior (no event) by default, requiring the user to type something extra only in the case when they want the event.

Obviously, we cannot change the semantic of queue::submit at this time because that would be a huge breaking API change. However, we can deprecate queue::submit and migrate code to a new API with a new name.

We can bikeshed the name later, but could we deprecate queue::submit in SYCL-Next and introduce some new API like queue::enqueue? The default behavior of this API would be to not return an event.

Clearly, we still need some way to get an event. Since this is a more advanced use case, we can expect the user to type more characters. For example, we could add a new API queue::enqueue_with_event for this purpose. (We could also overload queue::enqueue with a compile-time property similar to the previous proposal. However, it seems weird to me for the return type to change based on a property.)

This still leaves the 34 queue shortcut functions defined in section 4.6.5.2 (e.g. queue::single_task, etc.) One option is to deprecate all of these also and introduce new shortcut functions with names like queue::enqueue_single_task. Another option is to just leave all the shortcut functions the way they are now (returning an event), which is similar to some of the proposals above.

I know there is concern expressed above about expanding the API surface of the spec too much. However, this proposal does not add new APIs in the long term. It does add some new APIs in the short term, but it also deprecates some existing APIs. In the long term, those deprecated APIs will be removed, so there is no net increase over the long term.

TApplencourt commented 1 year ago

I like the idea of a new API; it's the choice who introduces the less change to the spec (other solutions need to introduce both a new property list to submit and the associated property).

For the shortcut function, I think it's ok to not have them. If people want to use advanced features, they can pay the price for it. We hit here the point of contention, I do think that returning an event and out-of-order queue are sane and simple defaults. But fortunately, we don't need to settle the argument here.

We should give the possibility of users not paying the price for the event they don't use. What the default should be, is IMO another following topic.

I will edit my summary with your proportion greg.

AerialMantis commented 1 year ago

Out of the two options described, I would prefer option (1), as I worry option (2) introduces too many subtle changes to the runtime behavior that we would need to account for; scope of synchronization, profiling, interop, etc.

Looking at option (1) I like the compile-time property idea, but I agree with @gmlueck's comment that if we are to follow the principle of keeping simple things simple, ultimately, the default behavior should be that submit does not return an event and the user should opt into this. If this is the goal, then I think the clearest route to achieve this would be to introduce a new interface such as a new submit like API which doesn't return an event by default.

If we are changing the interface, I think we'd have a few options for how to handle events optionally. We could as @gmlueck suggested have an alternative function that returns an event or a compile-time property. Alternatively, we could also have overloads that take the event as a parameter.

Additionally, if we are considering changing the queue interface to address this problem, we may want to also consider introducing a new object rather than (or in addition to) a new member function. If we look at the use case for events, if you're using an in-order queue you generally don't need to use events, unless you're synchronizing between multiple in-order queues, though in this case something like submit_barrier would be sufficient. So we could introduce a new object to represent in-order queues; something like execution_stream, that's lower-level, in-order, and would not return events. This would also create a clearer distinction between the in-order and out-of-order use cases in the type system, and SYCL runtime semantics.

TApplencourt commented 1 year ago

I presented a summary of the current ideas at the SYCL WG. Slide: SYCL_Optimize_non_used_events.pdf

One possible plan of action was:

One question that appears is whether the event as barriers is implementable, (in the highly unlikely case when events are used even users promised that they will not be).

I will discuss below my naive "implementation" (@illuhad is smarter and has implemented something better in hipsycl).

Profiling

If profiling is enabled in the profiling queue, the hint will be ignored.

"Regular" API call

Interopt:

OpenCL

A SYCL event can encapsulate one or multiple OpenCL events, representing a number of dependencies in the same or different contexts that must be satisfied for the SYCL event to be complete.

"Luckyly/Weirdly": Nothing is said about the binding of events to the command's last SYCL command. So a get_native -> clEnqueueBarrierWithWaitList and return the event generated is conformant.

If we want to keep intact people illusions that SYCL is just a super thing wrapper on top of native backends, we can:

Cuda

A SYCL event created from a CUstream will encapsulate that event. A single CUevent is potentially retrieved from a SYCL event if there is a valid native CUevent associated with it, otherwise nullptr is returned instead. The CUDA backend-specific free function cuda::has_native_event can be used to query whether the SYCL event has a valid native CUevent associated with it.

Barrier then cuRegisterEvent or just has_native_event return false.

gmlueck commented 1 year ago

What is the behavior of handler::depends_on for one of these "no-used" events?

TApplencourt commented 1 year ago

Submitting a barrier, waiting on it, and then submitting the depend_on kernel. It's mimicking that the previous kernel (the kernel that generated the event where we are now depending) is already finished (for DPCPP with L0 backend it's "kind of" similar with the ZE_SERIALIZE=2. )) . This work because one needs to submit the DAG of a kernel in a valid order.

nscottnichols commented 1 year ago

FWIW, I saw a 33% speedup on end-to-end runtime for my analytic continuation application when disabling events with the sycl::ext::oneapi::property::queue::discard_events property.

Pennycook commented 11 months ago

Thomas asked me to provide a quick update here, regarding an alternative proposal we've been sketching in https://github.com/intel/llvm/pull/11491. The basic idea is that it uses free functions (with new names) which do not return events, and any developer that wants an event has to explicitly opt-in through the creation of a handler.

// Default: Do not wait for any events, do not create any events
sycl::parallel_for(q, ...);

// Wait for events
sycl::submit(q, [&](sycl::handler& h) {
  h.depends_on(...);
  sycl::parallel_for(h, ...);
});

// Wait for events and create an event
sycl::event e = sycl::submit_with_event(q, [&](sycl::handler& h) {
  h.depends_on(...);
  sycl::parallel_for(h, ...);
});
al42and commented 9 months ago

On the topic of per-submission properties, here is another, unrelated, use-case: https://github.com/unisa-hpc/SYnergy/blob/60f66bb769d4a67f83b7c2c53d2fd353e2442f9e/include/queue.hpp#L100-L112.

It is, currently, a special class with queue-like semantics. But one can easily imagine it becoming a part of a SYCL implementation. Then, being able to pass a property_list to queue::submit would lead to a much cleaner and more composable API than tons of different overloads / variants (submit with event and with default performance hints, submit with event and with custom frequencies, submit without an event and with default frequencies, ...). queue::submit, with two nested lambdas, is already a cognitive burden to read. Spicing it up with macros to call the correct version based on which features the compiler supports and which features are needed here and now will not make things better. In this respect, property_list looks even more appealing to me.

Pennycook commented 9 months ago

Then, being able to pass a property_list to queue::submit would lead to a much cleaner and more composable API than tons of different overloads / variants (submit with event and with default performance hints, submit with event and with custom frequencies, submit without an event and with default frequencies, ...).

The proposal from https://github.com/intel/llvm/pull/11491 already supports per-submission properties.

Building on my example above, these frequency properties could be added like so:

auto properties = sycl::ext::properties(sycl::ext::uncore_frequency(x), sycl::ext::core_frequency(y));

// Default: Do not wait for any events, do not create any events
sycl::parallel_for(q, sycl::launch_config(r, properties), ...);

// Wait for events
sycl::submit(q, [&](sycl::handler& h) {
  h.depends_on(...);
  sycl::parallel_for(h, sycl::launch_config(r, properties), ...);
});

// Wait for events and create an event
sycl::event e = sycl::submit_with_event(q, [&](sycl::handler& h) {
  h.depends_on(...);
  sycl::parallel_for(h, sycl::launch_config(r, properties), ...);
});

Do you think we need something else?

al42and commented 9 months ago

Do you think we need something else?

No, I was just mentioning it here as a tidbit relevant to the overall discussion (which has two parts to it: how to optimize events, and whether to add per-submission property_list) :)

As I understand, sycl_ext_oneapi_enqueue_functions is, for now, just an extension, so how it will be implemented in the standard is still not fully settled.

masterleinad commented 5 months ago

Thomas asked me to provide a quick update here, regarding an alternative proposal we've been sketching in intel/llvm#11491. The basic idea is that it uses free functions (with new names) which do not return events, and any developer that wants an event has to explicitly opt-in through the creation of a handler.

Is this implemented anywhere now?

Pennycook commented 5 months ago

Is this implemented anywhere now?

A "true" implementation of these interfaces (i.e., one that removes the overheads associated with creating events) is still a work-in-progress. But if you just want to experiment with the new syntax, the University of Bristol wrote a small shim layer: https://github.com/UoB-HPC/BabelStream/blob/268315376390819147c7a8447d50a89d3220de0e/src/sycl_ext_enqueue_functions.h