KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
754 stars 404 forks source link

Timeline semaphores and resource state tracking issues #2441

Open mbechard opened 3 years ago

mbechard commented 3 years ago

Describe the Issue Hey, I'm having a validation error show up that I think is incorrect. I have a texture that is updated from the transfer queue and then used as a sampler in the graphics queue. The transfer queue releases the texture and does a TransferDst->ShaderRead layout transition. The graphics queue acquires it and records a usage of it, and then gives the texture pointer back to the transfer queue thread for future updates. The way the engine is setup the transfer queue will submit commands often, while the graphics queue will do it once or twice per frame. What often happens is that the transfer queue will re-use and submit the new usage of that texture before the graphics queue has submitted it's usage. This is all protected by timeline semaphores to ensure the actual usage isn't occurring in the transfer queue until the graphics queue is actually done with it. But the submission is out of order (in different queues).

When submitting the command buffer for the graphics queue, the error states: VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL--instead, current layout is VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. Which I think is occurring because the transfer queue has already submitted a new command buffer that is doing another release operation with a TransferDst->ShaderRead transition included, and this submission is changing the state tracking in the validation layer too soon.

It seems like the layout changes are recorded/validated at submission time, ignoring the fact that the timeline semaphore should be waited on before recording/validating the layout change in the validation layers. Likely since previous to timeline semaphores the binary semaphore submit-before-wait rule allowed for this logic.

I think with timeline semaphores the validation now should wait until any timeline semaphores it's waiting on have been submitted so this state tracking occurs in the correct order.

Am I correct that this is the current behavior? I suppose the work around for now is to do a CPU wait on the timeline semaphore before submission when using validation layers.

Valid Usage ID UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout

Environment:

mbechard commented 3 years ago

Can anyone confirm this behavior is occurring?

jeremyg-lunarg commented 3 years ago

I apologize for the delay in our response to this issue. I think you are correct about the current behavior of validation in this case. Handling this in validation without ruining performance or introducing potential deadlock could be tricky, so this will most likely not be a trivial fix.

Would it be easy for you to provide sample code that demonstrates the problem?

Also, another workaround could be to disable UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout by adding it to khronos_validation.message_id_filter in vk_layer_settings.txt (see https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/master/layers/vk_layer_settings.txt#L50).

mbechard commented 3 years ago

Sure, attached is an edited example from https://github.com/KhronosGroup/Vulkan-Samples/tree/master/samples/api/texture_loading

texture_loading.zip

I'm just transferring ownership from transfer queue -> gfx -> transfer queue while doing layout transitions. Search for 'BUGPOI' to find the line of interest. If those queue submits are reversed in order then things are fine, in that order the validation layers raise errors.

Yeah I can see how it'd be super tricky, but I think slowness better than false positives when it comes to using the validation layers. The ability to turn off these checks to improve performance should be an option, but the default behavior should be to report correctly. Possibly the validation can be deferred until the appropriate signal operation is submitted. That would avoid any stalls or deadlocks.

mbechard commented 3 years ago

This is coming up as a bigger issue for me. Using multiple queues with out-of-order submission via timeline semaphores with the debug layers is super prone to false positives. I've worked around it by manually doing wait() operations to serialize my queue submission to be in order of dependency, however this causes deadlocks if the multi-queue submission occurs within the same thread. I would assume a solution to this would be, if a queue is dependent on a timeline semaphore, defer the validation until a queue submit that will signal that semaphore, or a manual signal operation, occurs.

martty commented 2 years ago

I have ran into the same issue, I am attaching a reasonably minimal API dump: dump.txt

It contains the recording of two command buffers, but only the second is submitted - the first would be submitted subsequently. This submission however waits on the future submission of the first.

During the last command in the dump (the submit), the following validation false positives occur:

Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0: handle = 0x2e4077a7ed0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4dae5635 | vkQueueSubmit2KHR(): pSubmits[0].pCommandBufferInfos[0].commandBuffer command buffer VkCommandBuffer 0x2e4077a7ed0[] expects VkImage 0xd897d90000000016[] (subresource: aspectMask 0x1 array layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL--instead, current layout is VK_IMAGE_LAYOUT_UNDEFINED.

Validation Error: [ UNASSIGNED-VkImageMemoryBarrier-image-00004 ] Object 0: handle = 0x2e4077a7ed0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x4acfa767 | vkQueueSubmit(): in submitted command buffer VkImageMemoryBarrier acquiring ownership of VkImage (VkImage 0xd897d90000000016[]), from srcQueueFamilyIndex 2 to dstQueueFamilyIndex 0 has no matching release barrier queued for execution.

jeremyg-lunarg commented 2 years ago

I recently merged a PR that made a lot of improvements to timeline semaphore state tracking. I'm not sure yet if it will fix this problem described in the issue. Are you able to update to the latest Vulkan-ValidationLayers code and see if it is fixed? Otherwise I'll try to make a test case that shows the error (and if it is fixed), but it might be next week or so before I get to it.

martty commented 2 years ago

Hi, thank you for looking into it.

I built latest VVL (ad20b97), but unfortunately I still get the same validation errors.

jeremyg-lunarg commented 2 years ago

We're both tracking (CoreChecks::PostCallRecordQueueSubmit) and validating image layouts transitions at submit time but we shouldn't be doing this work until any previous wait semaphores have signaled. The PR I mentioned above gets us closer to making this possible but further reorg is required.

mbechard commented 2 years ago

You should be able to validate when the command buffer that will signal >= the timeline value is submitted, no need to wait for actual signal, I think?

jeremyg-lunarg commented 2 years ago

You should be able to validate when the command buffer that will signal >= the timeline value is submitted, no need to wait for actual signal, I think?

Good point and I think we're tracking enough information now in SEMAPHORESTATE::operations to do that.