[Roadmap Feedback] Add Stages for Layout Transitions, Queue Family Ownership Transfer Acquire/Release and Semaphore Signal

devshgraphicsprogramming commented 8 months ago

Problem statement:

Current design seems to treat Synchronization Commands as-if they do no memory writes: https://github.com/KhronosGroup/Vulkan-Docs/issues/302#issuecomment-236371674

The fact that NONE doesn't really mean NONE https://github.com/KhronosGroup/Vulkan-Docs/issues/2049

and you can and should have weird NONE -> ALL_COMMANDS dependencies between a layout transition and a semaphore signal are incredibly confusing and cause copious amounts of head-scratching.

QFOT just takes it to a new level #1036 & #2319

Use Case Example(s):

Synchronizing two Pipeline Barriers that do a layout transition (I know its redundant and useless) without any commands inbetween. https://github.com/KhronosGroup/Vulkan-Docs/issues/302#issuecomment-236460442

Synchronizing a Queue Family Ownership Release/Acquire with a Signal/Acquire of Sync2 Semaphore with Stage Mask

(Optional) Suggested Solution(s) (via opening an MR on vulkan-docs repo and creating a Proposal Document) :

Actually adding a stage for Layout Transition, Queue Family Ownership Transfer Acquire/Release and Semaphore Signal.

This has been thought about before: https://github.com/KhronosGroup/Vulkan-Docs/issues/302#issuecomment-236451219

They need not have any implicit ordering w.r.t. any other stage., but could be specced to have.

Tobski commented 7 months ago

Synchronizing a Queue Family Ownership Release/Acquire with a Signal/Acquire of Sync2 Semaphore with Stage Mask

Can you elaborate on what this use case is?

devshgraphicsprogramming commented 7 months ago

Synchronizing a Queue Family Ownership Release/Acquire with a Signal/Acquire of Sync2 Semaphore with Stage Mask

Can you elaborate on what this use case is?

                      /-------------------> Transition ---------------->---------\
                     /                                                            \
some work --> RELEASE -this is the problem-> SIGNAL --> WAIT -this is the problem-> ACQUIRE --> some more work

(ofc the signal and wait are submits on different queue families)

Basically right now its really murky how to set the dst stage mask on a Release and a mask on a Timeline Semaphore signal (and Acquire source mask and timeline Semaphore wait mask).

The reason is that the spec says the QFOT could perform memory writes, and they are guaranteed to happen between the Release and Acquire (as does the layout transition), however I need to make sure that the Acquire doesn't start before the Release is done.

Ofc we can sync via ALL_COMMANDS, but do we really want to? Especially since there's no meaningful work that could occur on the resource between the Ownership Transfer operation and the Semaphore operation, so what do I put if I don't want a full pipeline stall?

Especially since after the release, we can't use the resource within the same submit for anything.

Tobski commented 7 months ago

Thanks, that elaboration helped a lot.

This would've been a lot simpler if we had let the unused stage masks be meaningful (oops) because then you could just set the stages the same on both sides of the barrier. But as things stand, those parameters are ignored.

This isn't something that I think we'd consider urgent to fix, just on the basis that in practice everyone is more or less doing a full pipeline stall for semaphores anyway, but there are exceptions to that, so we should try to fix it.

devshgraphicsprogramming commented 7 months ago

This would've been a lot simpler if we had let the unused stage masks be meaningful (oops) because then you could just set the stages the same on both sides of the barrier. But as things stand, those parameters are ignored.

AFAIK, its only the ACCESS flags that are ignored, the stage masks are still meaningful.

Twas a typo or something cleared up in #1036 where the spec says stage, and it was changed to access, so now it means that stage is not ignored anymore.

Tobski commented 7 months ago

No the stage masks are definitely ignored too... it says it quite explicitly in the spec still? At least for sync2?

Tobski commented 7 months ago

From https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkBufferMemoryBarrier2KHR.html:

If buffer was created with VK_SHARING_MODE_EXCLUSIVE, and srcQueueFamilyIndex is not equal to dstQueueFamilyIndex, this memory barrier defines a queue family ownership transfer operation. When executed on a queue in the family identified by srcQueueFamilyIndex, this barrier defines a queue family release operation for the specified buffer range, and the second synchronization and access scopes do not synchronize operations on that queue. When executed on a queue in the family identified by dstQueueFamilyIndex, this barrier defines a queue family acquire operation for the specified buffer range, and the first synchronization and access scopes do not synchronize operations on that queue.

devshgraphicsprogramming commented 7 months ago

the second synchronization and access scopes do not synchronize operations on that queue.

Nor the destination queue?

krOoze commented 7 months ago

@Tobski Synchronization without scopes wouldn't make sense, and make it impossible to meet the "An application must ensure that these operations occur in the correct order by defining an execution dependency between them". Without synchronization scopes, there's no execution dependency. It wouldn't awfully make sense even since sync2 also introduces semaphore signal stages (and the semaphore therefore is not wholesale), so it actually makes it sensitive to the stage.

@devshgraphicsprogramming Pipeline barriers operate only on one\its queue, so that's given. It analogously says the same nonsense for the 1st scope of the acquire barrier (for the destination queue).

Tobski commented 7 months ago

Nor the destination queue?

I think this is covered by the general idea that you can't execute something on a queue that isn't the one you're submitting to.

Tobski commented 7 months ago

@Tobski Synchronization without scopes wouldn't make sense, and make it impossible to meet the "An application must ensure that these operations occur in the correct order by defining an execution dependency between them". Without synchronization scopes, there's no execution dependency. It wouldn't awfully make sense even since sync2 also introduces semaphore signal stages (and the semaphore therefore is not wholesale), so it actually makes it sensitive to the stage.

I'm not sure what you're responding to here? Can you elaborate on or quote what it was I said that this is in response to?

krOoze commented 7 months ago

@Tobski To you and the spec saying synchronization scopes are ignored in sync2.

That is nonsensical\bug, because the QFOT chapter says Release Op happens-before the second synchronization scope. (Implying there indeed is one)

and the second synchronization and access scopes do not synchronize operations on that queue.

vs

The release operation happens-after the availability operation, and happens-before operations specified in the second synchronization scope of the calling command.

Tobski commented 7 months ago

That's a throwback to sync1, the "calling command" doesn't have a synchronization scope in sync2. Definitely something to be fixed, but CTS is written to the explicit words of the first quote there not the implications of the second bit, so if something gets changed in the current spec, it'll be the QFOT section.

krOoze commented 7 months ago

What? Every synchronization command has synchronization scope(s). The whole Synchronization chapter and Execution and Memory Dependencies section is build around that concept...

Tobski commented 7 months ago

vkCmdPipelineBarrier has scopes laid out in its parameters, vkCmdPipelineBarrier2 doesn't have scopes directly - they're attached to the memory barrier structures. The way the spec is written is clearly contradictory so I'm not going to read into what it does mean here, merely noting that the language written in that manner originates from the way vkCmdPipelineBarrier worked.

krOoze commented 7 months ago

That's unnecessary semantics, imo. All sync has a command somewhere, otherwise it would never get executed. That is not really the problematic part...

The ignoring of stages (resp. "synchronization scopes") seem like regression bug. It does seem it is still entrenched in some authors mindset and was reintroduced with sync2. What it does mean imo, it is simply wrong as it was wrong in #1036. Since then it looked selfconsistent enough until the problem was reintroduced. The execution dependency of the sync command (or barrier parameters or whatever) is necessary to exist in order to sequence it with semaphore ops considering they have pWaitDstStageMask and pSignalSemaphoreInfos->stage.

Tobski commented 7 months ago

Yea agreed, I was just trying to call out the history and wires got crossed. No matter.

Anyway, this was a deliberate choice for sync2 because in sync1 many developers were already ignoring it, because in large production engines it's hard to keep track of these things - so we made that a feature of sync2.

It is indeed a regression in terms of semaphore matching though, so we should definitely look to address this somehow. I expect we could slot a "proper" fix into a maintenance extension; too many implementations exist in the wild to back this out now without a new extension.

Tobski commented 7 months ago

(Specifically, engines had a hard time matching the sync scopes, which was originally required, we could've removed the matching and kept the semantic meaning within the queue, which is probably how I'd opt to fix this in a maintenance extension)

krOoze commented 7 months ago

I am not fan of extensions while keeping the core in unstable state. Extenstions should be build on solid foundations, not compensating for them.

Feels like obvious spec bug, and therefore imo fair game to fix in place. Spec already says the app is solely responsible for providing execution dependency here (in the QFOT chapter as well as in VU). The Valid Usage does say the stages must be valid. These apps are already in invalid state, and some of them would remain in invalid state after the fix. Even if they were misled to believe that sync2 barrier performs no synchronization, it would only mean the app failed to meet spec demand to order QFOT ops, and therefore already enters undefined behavior by violating API contract.

I don't object to automatic\implicit synchronization if the spec does actually says so it is happening (which it doesn't say). I just don't want it to become same deal as with the damn swapchain semaphore hole that remains to be unfixed in unextended Vulkan for 8 years...

krOoze commented 7 months ago

PS: It is bit of a metaissue that none of those implementers of shipped products found it weird. That indicates some kind of dysfunction. Not sure of what nature though. Perhaps things are settled too much amongst insiders personally, rather than through the literal reading of the specification?

devshgraphicsprogramming commented 7 months ago

Nor the destination queue?

I think this is covered by the general idea that you can't execute something on a queue that isn't the one you're submitting to.

the whole thing was so weird to me, I started entertaining this thought because the former option didn't make much sense to me either (you shouldn't need to synchronize operations on the same queue either, just the semaphore signal, because you can't do meaningful operations on a resource you don't have ownership of).

Tobski commented 7 months ago

Feels like obvious spec bug, and therefore imo fair game to fix in place.

We don't generally fix things in place like this - there are several implementations in the wild that will never be updated. Anyone trying to run code on those systems expecting this to work the new way will likely just fall over in unexpected ways. We thus need something to delimit when you can actually rely on it, hence a maintenance extension.

Having said that, as we need to create CTS tests anyway for this new requirement, we can do that and see if everyone just happens to pass already - and if so then we can fix it in place.

Even if we do defer the "proper" fix til a maintenance extension, we'd still update the spec so it's not contradicting itself; it will just note the issue and how best to work around it.

PS: It is bit of a metaissue that none of those implementers of shipped products found it weird. That indicates some kind of dysfunction. Not sure of what nature though. Perhaps things are settled too much amongst insiders personally, rather than through the literal reading of the specification?

...or maybe it is just a bit of a corner case that nobody happened to notice at because synchronization is a complex topic? Sometimes humans make mistakes. This issue only just got noticed and sync2 is about 3 years old at this point. We're constantly improving our process to catch this kind of thing, but sometimes things are going to slip through.

krOoze commented 7 months ago

@Tobski

we'd still update the spec so it's not contradicting itself;

Yes, thank you. That is what I mean. Currently the specification gives zero ways to do it, therefore if we follow the letter of the specification those implementations you worry about are already in undefined behavior land. If there is supposed to actually be automatic synchronization, then that is kinda big change.

The reason I mention it is because this correction didn't happen in the other Issue I mentioned. Khronos "fixed" it in extension, but still left zero paths to do things in the core spec. We were thus devolved from rule-of-law of the specification, to "I guess it works on my particular PC, so the code must be ok".

It is not a corner case really. It is the main case, assuming you want to use sync2 and more queues. You are having it kinda both ways: either it is obscure corner case, or there are many mainstream implementations needing that case.

Meta:

I do hope the process is improving. There's a risk the group might be blind to the actual underlying cause though, IDK.

Someone put something incomplete in the spec regarding mainstream feature, and nobody noticed, including the driver makers that are supposed to put it in code (of which there are many independent teams). Many apps you say gone with very specific way to implement things not supported by the specification, and the way is contrary to the way things were supposed to be done previously. There's some larger pattern to why this is, than trivial oopsie.

The problem is if I imagine myself a driver developer, I would be like "I don't know how to implement this". If I was a validation layer implementer I would be like "I don't know how to check this". If I was app developer I would be like "I don't know how to conform to this". So to me it looks like under normal circumstances the problem is nearly impossible to slip in and stay unaddressed. So the question is why would it slip in anyway. I don't pretend to know exactly what the problem is, but one option is things are endemically implemented by word of mouth of the insiders, and the specification is only rarely read in the process?

Tobski commented 7 months ago

@krOoze the spec gets followed as best it can be, but generally when a new feature is implemented, pre-existing text in a place that we didn't notice can get left unedited, and implementers will generally focus on the changes. Reviewers do their best to consider interactions, but sometimes things get missed. If this had been easy to spot it's kind of wild that it took 3 years to surface. The reality is that most apps have treated semaphore signalling as a full pipeline flush, and most vendors are still implementing it that way (on Windows it's more or less forced by the OS) due to that; which is probably why nobody noticed the issue.

Meta: I assure you there's no grand "word of mouth" conspiracy here, we don't write the spec just to confuse people, it's actually what we write the implementations to. We make a significant investment into testing, validation, and process improvements so that these things actually stick. You can see evidence of this if you look at the fact that we now have actual public design documents, in the amount of tests written, and in the continued improvements in the Validation layers.

...although I am an insider so who knows, maybe I'm lying to maintain the status quo 😉

krOoze commented 7 months ago

@Tobski Am I missing some other added language? The newly added text in no way implies it is automatically synchronized. Quite the opposite, says less synchronization is being made by the barrier. There's nothing to follow, "bestly" or otherwise...

If I understand it, you want sync2 to perform automatic\implicit synchronization, but the specification does not seem to indicate in any form that it newly adds any such guarantee. The problem doesn't seem to be with the pre-existing text. The problem is the newly added text does not seem to align with authors intent then.

sync2 offers more semaphore parametrization, not less. The stage parametrization is there all the way from 1.0. So I don't think that is how that could make 99 % people to arrive at different interpretation and have no doubts\questions.

I don't mean conspiracy. If you think about it though, how would large amount of people arrive with utmost confidence at the conclusion there is automatic synchronization, when the specification does not say even remotely any such thing? If it is not in the specification, then the thought must have come from somewhere else (reference implementation, conference calls, insider consultants, groupthink, etc).

devshgraphicsprogramming commented 7 months ago

If I was app developer I would be like "I don't know how to conform to this".

Sometimes it seems like we're the in the minority of those willing to admit "I don't know" publicly and raise issues (they do take time out of your working day), most Vulkan devs I know just "code and pray it works" when it comes to complex niche topics. Most people I talked about this to, pretty much said "just use ALL_COMMANDS_BIT then" and didn't want to investigate further.

Another example I know is "can I delete an exported object via a Win32 HANDLE in the parent API and still use the imported one validly if I incremented the Shared Handle refcount? " nobody knows the answer really here. But thats a Win32 issue + lack of any spec, rather than Vulkan spec stuff.

Tobski commented 6 months ago

@Tobski Am I missing some other added language? The newly added text in no way implies it is automatically synchronized. Quite the opposite, says less synchronization is being made by the barrier. There's nothing to follow, "bestly" or otherwise...

I'm... not sure why you think that's what I was saying; I was commenting on how we got to the current inconsistency. Not on any supposed changes we'd made yet.

However, discussion has progressed on this issue in the WG, and an update should go out this week. We've ended up having to make things consistent between sync1 and sync2 in the direction of the extra sync scope not applying to either. We never tested anything other than ALL_COMMANDS for sync1, so we have no idea if other stages work (which means it probably doesn't). We'll keep this issue open until we "properly" resolve this in a future maintenance extension (which should make it into the roadmap and core eventually).

Trider12 commented 5 months ago

If I was app developer I would be like "I don't know how to conform to this".

It's frustrating that this stayed unnoticed for 3 years and I had to go into a deep rabbit hole of issues (leading to this issue) to figure this out.

KhronosGroup / Vulkan-Docs