clEnqueueMarkerWithWaitList output event handling is ambiguous

KhronosGroup / OpenCL-Docs

OpenCL API, OpenCL C, Extensions, SPIR-V Environment Specs, Ref page, and C++ for OpenCL doc sources.

Other

359 stars 112 forks source link

clEnqueueMarkerWithWaitList output event handling is ambiguous #1067

Open sanguinariojoe opened 8 months ago

sanguinariojoe commented 8 months ago

The issue actually started here, please take a look to the simple reproducer code.

The code does the following:

Create an user event and set it as CL_COMPLETE, called wait_event
Create a marker synced with that wait_event, called trigger
Create a third event to be set as complete by a callback, called finish
Set a callback on trigger, with the function that will set finish as CL_COMPLETE
Wait for finish

According to the documentation, it is unclear if the callback will be called, or it is required to call clFlush(). Indeed, on clEnqueueMarkerWithWaitList() ref page it says:

"The marker command either waits for a list of events to complete, or if the list is empty it waits for all commands previously enqueued in command_queue to complete before it completes."

Which says that the "dependencies" shall complete before the marker completes, but do not state anything about when that will happen, i.e. will that implicitly happen or a clFlush() call is required?. It also says that

"This command returns an event which can be waited on, i.e. this event can be waited on to insure that all events either in the event_wait_list or all previously enqueued commands, queued before this command to command_queue, have completed."

Which clearly indicates that the event can be used on a clWaitForEvents() call, but it is not stating nothing about non blocking operations (like callbacks or enqueues)

Finally, on clFlush() ref page it says:

"To use event objects that refer to commands enqueued in a command-queue as event objects to wait on by commands enqueued in a different command-queue, the application must call a clFlush or any blocking commands that perform an implicit flush of the command-queue where the commands that refer to these event objects are enqueued."

Which clearly states that clFlush() is required if it involves several queues, but unfortunately is not addressing this particular situation.

P.S. I suppose the same can be applied to clEnqueueBarrierWithWaitList()

bashbaug commented 8 months ago

Hi! Thanks for filing this issue and including an excellent reproducer.

As I understand, here is what the reproducer is doing, and the behavior you are seeing:

There is a call to clEnqueueMarkerWithWaitList() with an event dependency on a completed event (event_wait).
There is a callback on the completion event for clEnqueueMarkerWithWaitList() (trigger).
The callback on the completion event (trigger) sets a different completion event (finish).
There are no explicit calls to flush or finish the command-queue passed to clEnqueueMarkerWithWaitList.
There is a call to wait on the completion event set by the callback (finish), but there are no calls to wait on any other events.
The call to clWaitForEvents to wait on the completion event set by the callback (finish) is never returning on some implementations. In other words, it's a deadlock.

If I have this correct, I believe the behavior you are seeing is valid according to the spec, specifically because of (4) and (5). Some relevant portions of the spec are:

In Execution Model:

Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.

In clSetEventCallback:

Because commands in a command-queue are not required to begin execution until the command-queue is flushed, callbacks that enqueue commands on a command-queue should either call clFlush on the queue before returning, or arrange for the command-queue to be flushed later.

Also, the paragraph in clFlush that you referenced above, and the paragraph immediately before it. Note, the call to clWaitForEvents in (6) has no linkage to the command-queue where clEnqueueMarkerWithWaitList was enqueued because finish is a user event set in the callback, so it does not necessarily implicitly flush the command-queue:

Any blocking commands queued in a command-queue and clReleaseCommandQueue perform an implicit flush of the command-queue. These blocking commands are clEnqueueReadBuffer, clEnqueueReadBufferRect, clEnqueueReadImage, with blocking_read set to CL_TRUE; clEnqueueWriteBuffer, clEnqueueWriteBufferRect, clEnqueueWriteImage with blocking_write set to CL_TRUE; clEnqueueMapBuffer, clEnqueueMapImage with blocking_map set to CL_TRUE; clEnqueueSVMMemcpy with blocking_copy set to CL_TRUE; clEnqueueSVMMap with blocking_map set to CL_TRUE or clWaitForEvents.

To use event objects that refer to commands enqueued in a command-queue as event objects to wait on by commands enqueued in a different command-queue, the application must call a clFlush or any blocking commands that perform an implicit flush of the command-queue where the commands that refer to these event objects are enqueued.

That said, I'm all for adding clarifications since this can be confusing. One obvious upgrade would be to augment the paragraph in clFlush because clearly an explicit clFlush is required in other scenarios beyond just the case where an event object is a dependency in a different command-queue. Do you have any other suggestions for improvement?

PS: I think this applies to any non-blocking enqueue command, not just clEnqueueBarrierWithWaitList()!

karolherbst commented 8 months ago

one thing I want to add here: ROCm, Intel NEO and pocl all seem to not dead lock in that example and the main reason I asked the user to report this here was that even if rusticl behaves correct to the spec, it's kinda pointless if applications generally expect different behavior if other implementations do implicitly flush the marker (or whatever they are doing).

So if applications do already rely on the code to work, the spec could also just require implementations to do so. My only concern with that is, that in an out-of-order queue this is quite harmless, but in an in-order queue that would mean flushing the entire queue. So I'm not quite sure what would be the best course of action here.

karolherbst commented 8 months ago

To use event objects that refer to commands enqueued in a command-queue as event objects to wait on by commands enqueued in a different command-queue, the application must call a clFlush or any blocking commands that perform an implicit flush of the command-queue where the commands that refer to these event objects are enqueued.

I think we also have applications in the wild not flushing those events, e.g. I had to fix some cross queue deadlocks specifically for davinci resolve, which I'm not quite sure the spec actually requires the runtime to implicitly flush here.

So overall I'm a concerned that reality just doesn't match the spec here and that OpenCL runtimes already work around "broken" applications by implicitly flushing in some cases where it's not strictly required by the spec.

bashbaug commented 8 months ago

So overall I'm a concerned that reality just doesn't match the spec here and that OpenCL runtimes already work around "broken" applications by implicitly flushing in some cases where it's not strictly required by the spec.

I was curious how this works with NEO and it appears one of two things is happening. Either:

We are executing in an "eager" mode, in which case the command-queue is more-or-less flushed immediately whenever it gets a command, or...
We are executing in a "lazy" batched mode, in which case we will flush the command-queue when a callback is attached to an event that has not been flushed yet (link).

So, we're either in the "eager" mode and no workaround is required, or we implicitly flush to have the same behavior as-if we were in the "eager" mode, which could be seen as working around "broken" applications, albeit with a possible performance cost due to the implicit flush.

Should we say that clSetEventCallback may (must?) perform an implicit flush, such that the event will eventually complete, assuming there are no unsatisfied dependencies?

sanguinariojoe commented 8 months ago

My two cents:

Because commands in a command-queue are not required to begin execution until the command-queue is flushed, callbacks that enqueue commands on a command-queue should either call clFlush on the queue before returning, or arrange for the command-queue to be flushed later.

This is not applying because my callback is not enqueuing commands, unless clSetUserEventStatus() is treated as so (which I would say is not the case according to the spec).

Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.

This is controversial, because I am calling clWaitForEvents() after clSetEventCallback(). And as @bashbaug has pointed out clWaitForEvents() is a blocking call which is implicitly flushing.

Is it a possibility that the problem are the user events? They are created on contexts rather than command queues... And if that is the case, is it a bug (so all the command queues attached to the context shall be flushed)? Or is it a specification problem?

karolherbst commented 8 months ago

My two cents:

Because commands in a command-queue are not required to begin execution until the command-queue is flushed, callbacks that enqueue commands on a command-queue should either call clFlush on the queue before returning, or arrange for the command-queue to be flushed later.

This is not applying because my callback is not enqueuing commands, unless clSetUserEventStatus() is treated as so (which I would say is not the case according to the spec).

Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.

This is controversial, because I am calling clWaitForEvents() after clSetEventCallback(). And as @bashbaug has pointed out clWaitForEvents() is a blocking call which is implicitly flushing.

Is it a possibility that the problem are the user events? They are created on contexts rather than command queues... And if that is the case, is it a bug (so all the command queues attached to the context shall be flushed)? Or is it a specification problem?

User events can only be signaled by the application, so the concept of queue don't really apply to them, just that events depending on user events might cause stalls or dead-locks on that queue if there are application bugs. The issue are the conditions leading to the application not signaling those user events. And we just need to figure out if markers and barriers shall be implicitly flushed. Maybe we also just need to explicitly define events not having any work attacked (like markers and barriers) and state they complete independent from the flushing mechanism as long as their own condition is fulfilled and their deps (for an in-order queue that would mean any previous event as well) did complete.

So overall I'm a concerned that reality just doesn't match the spec here and that OpenCL runtimes already work around "broken" applications by implicitly flushing in some cases where it's not strictly required by the spec.

I was curious how this works with NEO and it appears one of two things is happening. Either:
1. We are executing in an "eager" mode, in which case the command-queue is more-or-less flushed immediately whenever it gets a command, or...

2. We are executing in a "lazy" batched mode, in which case we will flush the command-queue when a callback is attached to an event that has not been flushed yet ([link](https://github.com/intel/compute-runtime/blob/7729eb8127eea5e37cdd93a93e6877ecc86d88bd/opencl/source/api/api.cpp#L2262)).
So, we're either in the "eager" mode and no workaround is required, or we implicitly flush to have the same behavior as-if we were in the "eager" mode, which could be seen as working around "broken" applications, albeit with a possible performance cost due to the implicit flush.

yeah, my point is that I don't think this behavior is strictly required by the spec and I could read this as "some application required this, so we just added this code". Maybe there are good reasons to do so regardless as it might increase throughput when the runtime determined the queue to be idle.

Should we say that clSetEventCallback may (must?) perform an implicit flush, such that the event will eventually complete, assuming there are no unsatisfied dependencies?

I think this could lead to a lot of implicit flushing if applications attach callbacks for other purposes than signalling user events. Like.. what if an application wants to do some logging based on this? But then again, if most runtimes already do something like that, might as well just specify it, so implementations do not diverge too much here.

It's not really feasible to specify something runtimes may do, if 90% of them do it and applications rely on that behavior leaving a few runtimes being "technically correct", but having to deal with broken applications, so it should be a must do.

bashbaug commented 8 months ago

I was thinking about this over the weekend but I didn't have any epiphanies. It's a tricky problem - as long we have implementation-defined flushing behavior there are going to be some implementations that flush and are fine, and others that don't and deadlock. I think this means we either need to live with the "potential deadlock" behavior, or we need to require flushing in more scenarios, thereby making the flushing behavior no longer implementation-defined.

I suppose we could also consider adding queries so an application could programmatically determine the implementation's behavior, but I'm wary of adding more complexity since it's already hard enough to get this "right".

So, what do we want to do here?

Maybe a good place to start is: If we did want to require flushing in more scenarios, what scenarios should we consider? The NEO behavior where clSetEventCallback requires an implicit flush is one possibility, but I'm sure it's not the only one...

sanguinariojoe commented 8 months ago

I really think flushes are OK where they are. More implicit flushing seems more like a performance related topic than a requirement.

The solution I suggest is clearly specifying what shall be flushed on ALL the blocking operations (for instance, I am missing clFinish() on the clFlush() list). Along this line, I really think the implicit flush shall be done on the command queue passed as parameter (if any), as well as the command queues associated to the events on the wait list. I would say that is actually what is happening now on almost every implementation.

Now the two open questions...

What about user events? I think the safest way is flushing all the command queues on the associated context. That is for sure fixing this issue. Do you you foresee problems on any other case?

What about multiple contexts? e.g. I can ask a callback to be called on context A, which is setting as complete an event created on a different context, B. In my opinion, if the user just calls to clWaitForEvents on the context B, without flushing context A, it is legit to end up on a deadlock.

P.S. If flushing the command queues on the user event context is not acceptable, then I think the way to go is clearly documenting this special behavior of the user events, and the eventual pitfalls.

karolherbst commented 8 months ago

I wonder if we want to add a CL_QUEUE_NO_IMPLICIT_FLUSHING cl_queue_properties so applications can verify that they don't rely on implicit flushing?

It might also allow to write some CTS tests to verify that implementations behave the same if not implicitly flushing anything?

MathiasMagnus commented 3 months ago

I wonder if we want to add a CL_QUEUE_NO_IMPLICIT_FLUSHING cl_queue_properties so applications can verify that they don't rely on implicit flushing?

I would like to second that such a property would be nice. Just after Yesterday's WG discussions I came across a related, though different issue. It's not strictly a correctness issue, just more noisy than perhaps necessary. Consider the following:

Enqueue a buffer for mapping (map_ev)
Create a user event (copy_ev)
Set a callback on map_ev which asynchronously launches a func with a copy into the mapped pointer and a flip of copy_ev once done.
Enqueue unmapping said buffer (unmap_ev) which waits on copy_ev.

Currently, this idiom isn't guaranteed to function properly, because nothing prohibits the map to complete before the callback setup on map_ev is completed. It seems silly to have the initial map wait on a user event solely to avoid the icd performing an implicit flush and potentially completing map_ev before the program gets a chance to create a user event and setup the callback (which may be an arbitrarily expensive operation, but easily comparable to a buffer map).

Kerilk commented 3 months ago

Disclaimer, I think @karolherbst idea is very good.

@MathiasMagnus My understanding was that in your case the callbacks should be called in all scenarios irrespective of the race condition:

Map finishes after unmapping is enqueued
- map finishes, signals map_ev
- callback executes, signals copy_ev
- unmap executes, signals unmap_ev
Map finishes before before callback is attached
- map finishes, signals map_ev
- callback is attached, and executes immediately since the COMPLETE state has been reached ( in my experience the implementation uses the current thread to do so, could be on another thread though)
- then, the unmap operation is enqueued and executes

so from my perspective the only thing that is not guaranteed in your case is parallelism. It should execute fine irrespective of the scheduling.

If you don't observe this behavior, then I would say the implementation is wrong.

Reading the specification (https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/clSetEventCallback.html), maybe the description should be improved.

If my understanding is wrong, there still is a way to make it work according to how the spec is currently worded:

Enqueue a buffer for mapping (map_ev)
Create a user event (copy_ev)
Set a callback on map_ev which asynchronously launches a func with a copy into the mapped pointer and a flip of copy_ev once done.
release map_ev
Enqueue unmapping said buffer (unmap_ev) which waits on copy_ev.

All callbacks registered for an event object must be called before the event object is destroyed.

Would guarantee the behavior, but I still think that callabcks should be called if the event is past it's status. The above is just to guarantee implementation do not prematurely destroy an object without calling all callbacks. Sorry for derailing the thread.

MathiasMagnus commented 3 months ago

@Kerilk Indeed, the spec wording clearly states, that the callback is called when states transition. If the intention was that callbacks also trigger if the event is already past the status it's registered for, that should be added. (If I were an implementer, based on this description, I'd only call callbacks on transitions.)

I haven't done testing if there are faulty ICDs, especially it's a race probably not trivially triggered.

Kerilk commented 3 months ago

I haven't done testing if there are faulty ICDs, especially it's a race probably not trivially triggered.

I have seen this triggered on Intel GPUs, and the behavior was that the callback would be called immediately, before the function attaching the callback returned.

bashbaug commented 3 months ago

I still think that callabcks should be called if the event is past it's status.

I agree. Without this behavior it seems like you'd either be in a race to set the callback before the event transitions to the callback state (risky), or you'd need to hold back execution via e.g. a user event until the callback has been set (overkill in many situations).

Is this something we can fix?

Maybe a good next step is to fork this discussion into a separate issue, since this seems to be a different (albeit related) question.

karolherbst commented 3 months ago

yeah, I was wondering about this situation as well, because I was looking at code related to it and in the past I implement it as "call the callback immediately when the event has already reached the status", but it's not actually required by the spec. There is just this innocent looking statement: "All callbacks registered for an event object must be called before the event object is destroyed."

So I didn't think it's worth bringing up, because runtimes have to call it at some point anyway. The question then just is, should the spec require it to be done at clSetEventCallback time or whenever the runtime things it's right.

Maybe "All callbacks registered for an event object must be called before the event object is destroyed." should be changed to "All callbacks registered for an event object, including for already reached command_exec_callback_type, must be called before the event object is destroyed." or something along those lines.