KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
770 stars 406 forks source link

Bug in VUID-VkDescriptorImageInfo-imageLayout-00344 #4646

Open forenoonwatch opened 2 years ago

forenoonwatch commented 2 years ago

Describe the Issue Validation layers emit a validation error upon calling vkCmdDrawIndexed when binding a descriptor set expecting an image that is in a layout other than undefined if the barrier transitioning the image is located in another command buffer, even if these command buffers are submitted together in the correct order.

To replicate, create 2 command buffers, emit a barrier to transition an image on command buffer 1. On command buffer 2, bind a descriptor set referencing this image at its desired layout, and invoke vkCmdDrawIndexed. Submit both of these command buffers in order as the pCommandBuffers of a single VkSubmitInfo.

Valid Usage ID

VUID-VkDescriptorImageInfo-imageLayout-00344(ERROR / SPEC): msgNum: -564812795 - Validation Error: [ VUID-VkDescriptorImageInfo-imageLayout-00344 ] Object 0: handle = 0x29359b01db8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xde55a405 | vkCmdDrawIndexed: Cannot use VkImage 0x5c59a0000000056[] (layer=0 mip=0) with specific layout VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL that doesn't match the previously used layout VK_IMAGE_LAYOUT_UNDEFINED. The Vulkan spec states: imageLayout must match the actual VkImageLayout of each subresource accessible from imageView at the time this descriptor is accessed as defined by the image layout matching rules (https://vulkan.lunarg.com/doc/view/1.3.224.1/windows/1.3-extensions/vkspec.html#VUID-VkDescriptorImageInfo-imageLayout-00344)
    Objects: 1
        [0] 0x29359b01db8, type: 6, name: NULL
[ERROR: Validation]

These validation errors appear for the first few frames of execution and then cease.

Environment:

Additional context In my personal code encountering this issue, the image emitting the error is transitioned from VK_IMAGE_LAYOUT_UNDEFINED to VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL via a render pass. It is then transitioned to the desired VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL with a pipeline barrier after executing compute shaders. After this, the image is bound in a descriptor set and used in another render pass. If this operation is submitted to a single command buffer, there are no validation errors. If this operation is split into 2 command buffers just after the barriers, the above validation errors are emitted. All command buffers are primary and one-time-submit.

forenoonwatch commented 2 years ago

Update: I believe this bug only happens if the above conditions exist and the VkImage resource is given as a color attachment on a render pass, but never referenced in any subpass.

OnePride commented 9 months ago

I've just got the same bug, but with a depth-stencil image. Two command buffers, the first one does the layout transition: UNDEFINED -> DEPTH_STENCIL_ATTACHMENT_OPTIMAL, then submits. The second one does ATTACHMENT_OPTIMAL -> READ_ONLY_OPTIMAL. After submit I get:

Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ] Object 0: handle = 0x21e1dc6cf50, name = Stream0_GraphCommandBuffer1, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x4995200000000d81, name = DummyShadowCascade, type = VK_OBJECT_TYPE_IMAGE; | MessageID = 0x4dae5635 | vkQueueSubmit2(): pSubmits[0].pCommandBufferInfos[0].commandBuffer command buffer VkCommandBuffer 0x21e1dc6cf50[Stream0_GraphCommandBuffer1] expects VkImage 0x4995200000000d81[DummyShadowCascade] (subresource: aspectMask 0x2 array layer 1, mip level 0) to be in layout VK_IMAGE_LAYOUT_DEPTH_ATTACHMENT_OPTIMAL--instead, current layout is VK_IMAGE_LAYOUT_UNDEFINED.

Win 10, NVIDIA GeForce RTX 2060, Vulkan SDK 1.3.275.0

glebov-andrey commented 9 months ago

@OnePride Was your case by any chance with GPU-Assisted validation enabled? I'm seeing similar issues but specifically with GPU-AV and not in any other configuration.

jeremyg-lunarg commented 9 months ago

@OnePride Was your case by any chance with GPU-Assisted validation enabled? I'm seeing similar issues but specifically with GPU-AV and not in any other configuration.

If you're using descriptor indexing, this VUID will be checked by GPU-AV. If not, it'll be checked with core validation. But it would be good to know if this problem is only happening on the GPU-AV path or on both.

glebov-andrey commented 9 months ago

@jeremyg-lunarg OK, that's really odd then. Because in my case GPU-AV is reporting these errors for images which are never used in descriptor sets with descriptor indexing. Although other shaders definitely use descriptor indexing for other resources - is the check global for the device?

Another thing I've noticed is that both the expected and current layouts are often incorrect. For example here the shader in question actually uses the image as in a COMBINED_IMAGE_SAMPLER descriptor expecting the SHADER_READ_ONLY_LAYOUT:

Validation Error: [ UNASSIGNED-CoreValidation-DrawState-InvalidImageLayout ]
Object 0: handle = 0x7ff3b023e510, name = CommandBuffer | Frame 0 | pass_name, type = VK_OBJECT_TYPE_COMMAND_BUFFER;
Object 1: handle = 0x73a850000000004d, name = graph_res#0_image, type = VK_OBJECT_TYPE_IMAGE; |
MessageID = 0x4dae5635 | vkCmdDraw(): 
command buffer VkCommandBuffer 0x7ff3b023e510[CommandBuffer | Frame 0 | pass_name] expects VkImage 0x73a850000000004d[graph_res#0_image]
(subresource: aspectMask 0x1 array layer 0, mip level 0) to be in layout VK_IMAGE_LAYOUT_GENERAL--instead, current layout is VK_IMAGE_LAYOUT_UNDEFINED.

And the sequence of commands leading up to here is pretty simple:

  1. vkPipelineBarrier2:
    srcStageMask VK_PIPELINE_STAGE_2_NONE (OK because this is frame 0)
    srcAccessMask VK_ACCESS_2_NONE
    dstStageMask VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT
    dstAccessMask VK_ACCESS_2_SHADER_READ_BIT | VK_ACCESS_2_SHADER_WRITE_BIT
    oldLayout   VK_IMAGE_LAYOUT_UNDEFINED
    newLayout   VK_IMAGE_LAYOUT_GENERAL
  2. vkCmdDispatch: STORAGE_IMAGE
  3. vkPipelineBarrier2:
    srcStageMask VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT
    srcAccessMask VK_ACCESS_2_SHADER_WRITE_BIT
    dstStageMask VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT
    dstAccessMask VK_ACCESS_2_SHADER_READ_BIT
    oldLayout   VK_IMAGE_LAYOUT_GENERAL
    newLayout   VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
  4. vkCmdDispatch: COMBINED_IMAGE_SAMPLER - ERROR HERE with expected layout GENERAL
  5. vkCmdDispatch: COMBINED_IMAGE_SAMPLER - ANOTHER ERROR HERE In this case the message has the correct expected layout (SHADER_READ_ONLY_OPTIMAL) but still says that the current layout is UNDEFINED

In fact I'm getting over 1300 such errors per frame for different resources (often one error per subresource). Some are particularly interesting because they expect COLOR_ATTACHMENT_OPTIMAL for a sampled image in a compute shader, while also reporting the command as vkCmdDraw.

Another problem is that I can't seem to create a minimal reproducer. I've tried swapping the order of command buffer recording, adding an extra command buffer between them, but so far no luck.

Should I create a new issue for this?

jeremyg-lunarg commented 9 months ago

Thanks for the details. If you can capture anything that reproduces the error with gfxreconstruct I should be able to take a look and/or whittle it down to a manageable size.

Should I create a new issue for this?

I don't think that's needed at this point.

glebov-andrey commented 9 months ago

@jeremyg-lunarg I've managed to create a gfxreconstruct capture which reproduces the issue. Would it be OK if I sent it via email?

I had to modify gfxreconstruct to get it to play so here's the patch for that: Work_around_crash_with_imageless_framebuffer.patch.

jeremyg-lunarg commented 9 months ago

@jeremyg-lunarg I've managed to create a gfxreconstruct capture which reproduces the issue. Would it be OK if I sent it via email?

Yes, my email is jeremyg@lunarg.com

I had to modify gfxreconstruct to get it to play so here's the patch for that: Work_around_crash_with_imageless_framebuffer.patch.

Thanks for that. Do you think this patch is worthy of a PR to gfxreconstruct?

glebov-andrey commented 9 months ago

@jeremyg-lunarg I've sent you the email with the capture

Do you think this patch is worthy of a PR to gfxreconstruct?

I'm actually pretty sure this is not the correct way to fix the issue (it's just a workaround). I'm planning to report the issue properly when I get the chance. There are actually more like 3 bugs to report to gfxreconstruct.

jeremyg-lunarg commented 8 months ago

@glebov-andrey Sorry it has taken so long for me to investigate this. Running your trace with Core validation, I get several instances of VUID-vkBeginCommandBuffer-commandBuffer-00049, one of these is for the command buffer submission containing the pipeline barriers for the image layout transitions to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. When running GPU-AV I suspect the premature vkBeginCommandBuffer() causes data related to be 'lost' which later causes the false image layout errors you reported. It seems like there may be a missing vkQueueWaitIdle() or vkWaitSemaphores() call needed in your applicaiton before trying to reuse these command buffers.

jeremyg-lunarg commented 8 months ago

@forenoonwatch or @OnePride, https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull/7669 is my attempt to make a test case that replicates the original problem in this Issue. But it seems like I'm missing something as the test currently passes. Could either of you tell me what needs to change to match your code that hit the error?

glebov-andrey commented 8 months ago

Running your trace with Core validation, I get several instances of VUID-vkBeginCommandBuffer-commandBuffer-00049, one of these is for the command buffer submission containing the pipeline barriers for the image layout transitions to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL. When running GPU-AV I suspect the premature vkBeginCommandBuffer() causes data related to be 'lost' which later causes the false image layout errors you reported. It seems like there may be a missing vkQueueWaitIdle() or vkWaitSemaphores() call needed in your applicaiton before trying to reuse these command buffers.

@jeremyg-lunarg Thanks for looking into this! Regarding VUID-vkBeginCommandBuffer-commandBuffer-00049 - I can't get anything like that to happen in the real application, so I think this is an issue with gfxreconstruct incorrectly handling the inter-thread usage of timeline semaphores (calls to vkGetSemaphoreCounterValue and vkWaitSemaphores can affect resumption of work in other threads). That being said, after updating to the current main branch of VVL, I can no longer reproduce the original image layout errors.

There is however a new problem which might be related - a race (reported by TSAN) between an image being destroyed and it's layout being validated. Sometimes this results in an assertion failure here: https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/441b6a90d2320dcc962622d1f3cd5bea71423c7b/layers/gpu_validation/gpu_image_layout.cpp#L766 The call to destroy is immediately after observing that the command buffer has completed execution via vkGetSemaphoreCounterValue, and the assertion is triggered in a VVL-internal thread. Perhaps vkGetSemaphoreCounterValue returns values for no-yet-validated submissions?