KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
763 stars 403 forks source link

Incorrect SYNC-HAZARD-WRITE-AFTER-READ error with vkCmdPipelineBarrier2 applied to suballocated buffers #8502

Closed HuYuxin closed 1 week ago

HuYuxin commented 1 month ago

Environment:

Describe the Issue

To reproduce:

  1. Sync latest ANGLE: https://github.com/google/angle
  2. Cherry-pick these two changes: https://chromium-review.googlesource.com/c/angle/angle/+/5832709/4 (patchset 4) https://chromium-review.googlesource.com/c/angle/angle/+/5832710/10 (patchset 10)
  3. Set below gn args gn args out/LinuxDebug
    dcheck_always_on = true
    is_component_build = true
    is_debug = false
    symbol_level = 1
    angle_enable_cl=true
  4. Compile with command autoninja -C out/LinuxDebug angle_end2end_tests
  5. Run test with command out/LinuxDebug/angle_end2end_tests --gtest_filter="LineLoopIndirectTest.UShortIndexIndirectBuffer/ES3_1_Vulkan" --verbose --local-output

Observe that the test failed with this validation error:

[ SYNC-HAZARD-WRITE-AFTER-READ ] Validation Error: [ SYNC-HAZARD-WRITE-AFTER-READ ] Object 0: handle = 0x5557db026030, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x200000000020, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0x376bc9df | vkCmdCopyImageToBuffer():  Hazard WRITE_AFTER_READ for dstBuffer VkBuffer 0x200000000020[], region 0. Access info (usage: SYNC_COPY_TRANSFER_WRITE, prior_usage: SYNC_INDEX_INPUT_INDEX_READ, read_barriers: VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, command: vkCmdDrawIndexedIndirect, seq_no: 8, reset_no: 3, resource: VkBuffer 0x200000000020[]).

Expected behavior

Test should pass without the SYNC-HAZARD-WRITE-AFTER-READ validation error.

Valid Usage ID If applicable, please include the validation messages encountered leading up to the issue

[ SYNC-HAZARD-WRITE-AFTER-READ ] Validation Error: [ SYNC-HAZARD-WRITE-AFTER-READ ] Object 0: handle = 0x5557db026030, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x200000000020, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0x376bc9df | vkCmdCopyImageToBuffer():  Hazard WRITE_AFTER_READ for dstBuffer VkBuffer 0x200000000020[], region 0. Access info (usage: SYNC_COPY_TRANSFER_WRITE, prior_usage: SYNC_INDEX_INPUT_INDEX_READ, read_barriers: VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, command: vkCmdDrawIndexedIndirect, seq_no: 8, reset_no: 3, resource: VkBuffer 0x200000000020[]).

Additional context

The renderdoc capture shows that VVL thinks there is a missing execution barrier on the VkBuffer:

VkBuffer used as index buffer for vkCmdDrawIndexed() --- VVL thinks there is a missing execution barrier--- VkBuffer used as dest buffer for vkCmdCopyImageToBuffer()

However, the two buffers are suballocated. Even they belong to the same big VkBuffer, they are actually pointing to different memories.

We think that using VK_KHR_Synchronization2 exposed this issue in VVL. Before we use VK_KHR_Synchronization2, the srcStageMask and dstStageMask are shared between VkMemoryBarrier, VkBufferMemoryBarier, and VkImageMemoryBarrier. in the vkCmdPipelineBarrier () call. Between the vkCmdDrawIndex() and vkCmdCopyImageToBuffer(), even if only image barrier is needed, the srcStageMask and dstStageMask values also apply to the VkMemoryBarrier and VkBufferMemoryBarrier, and it fulfills VVL's validation on buffer execution barrier (even there is no execution barrier needed on buffer because they are suballocated and point to different memories). After switching to VK_KHR_Synchronization2, the srcStageMask and dstStageMask are packed within the VkMemoryBarrier2, VkBufferMemoryBarier2, VkImageMemoryBarrier2. So now if VkMemoryBarrier2 is null, there is no execution barrier applied to VkBuffer, and VVL incorrectly thinks that there is a missing execution barrier on the buffer.

ShabbyX commented 1 month ago

There's likely some relationship with https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/3605

Basically, the range of a buffer resource is not taken into account in syncval (it seems)

artem-lunarg commented 1 month ago

I wonder if https://github.com/KhronosGroup/Vulkan-ValidationLayers/issues/3605 was fixed by https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull/8141, we indeed did not compute offsets correctly for dynamic descriptors (but I did not look closely at these issues yet, so could be a wrong suggestion). This issue should be something different because it is based on the latest code.

ShabbyX commented 1 month ago

I wonder if #3605 was fixed by #8141, we indeed did not compute offsets correctly for dynamic descriptors (but I did not look closely at these issues yet, so could be a wrong suggestion). This issue should be something different because it is based on the latest code.

Yes, looks like the dynamic offset part is fixed: https://chromium-review.googlesource.com/c/angle/angle/+/5873529

We're still getting false positives because the vertex buffer ranges are not taken into account (it seems)

artem-lunarg commented 3 weeks ago

Should be fixed by https://github.com/KhronosGroup/Vulkan-ValidationLayers/pull/8674

ShabbyX commented 2 weeks ago

It doesn't look like this was about indirect buffers only though. I tried removing this suppression and we still get VVL failures: https://chromium-review.googlesource.com/c/angle/angle/+/5937422

Typically, the failures we see are in the form of vertex-attribute usage reported not synced with transfer, but those are happening to different ranges of the buffer.

artem-lunarg commented 2 weeks ago

It doesn't look like this was about indirect buffers only though

The error message in this issue is exactly what did not work (indirect command that accesses index or vertex buffer which are suballocated) but I might miss something.

[ SYNC-HAZARD-WRITE-AFTER-READ ] Validation Error: [ SYNC-HAZARD-WRITE-AFTER-READ ] Object 0: handle = 0x5557db026030, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x200000000020, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0x376bc9df | vkCmdCopyImageToBuffer(): Hazard WRITE_AFTER_READ for dstBuffer VkBuffer 0x200000000020[], region 0. Access info (usage: SYNC_COPY_TRANSFER_WRITE, prior_usage: SYNC_INDEX_INPUT_INDEX_READ, read_barriers: VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, command: vkCmdDrawIndexedIndirect, seq_no: 8, reset_no: 3, resource: VkBuffer 0x200000000020[]).

@ShabbyX I'm not sure where I need to look for the errors in the above link, could you help to locate it?

artem-lunarg commented 2 weeks ago

I'm going to use angle directly for testing to be sure. Will be looking into this tomorrow, hopefully we will fix this.

artem-lunarg commented 2 weeks ago

@ShabbyX could you confirm that out/LinuxDebug/angle_end2end_tests --gtest_filter="LineLoopIndirectTest.UShortIndexIndirectBuffer/ES3_1_Vulkan" --verbose --local-output command line still produces the errors reported by this issue when the corresponding section in vk_renderer.cpp is uncommented?

Also should I apply any patches to the latest angle code? My impression that the first patch (https://chromium-review.googlesource.com/c/angle/angle/+/5832709/4) is already merged but not the second (https://chromium-review.googlesource.com/c/angle/angle/+/5832710/10).

ShabbyX commented 2 weeks ago

That test is fixed, I can confirm. Unsuppressing related VVL errors in https://chromium-review.googlesource.com/c/angle/angle/+/5937422, there are other failures, not related to indirect buffers. You don't need to cherry-pick anything, it has nothing to do with https://chromium-review.googlesource.com/c/angle/angle/+/5832710 and the issue is in fact much older than that.

Try taking that first change (unsuppresses VVL errors) and run this test: --gtest_filter=BufferDataTestES3.CopyBufferSubDataDraw/ES3_Vulkan. I get this:

vk_renderer.cpp:777 (DebugUtilsMessenger): [ SYNC-HAZARD-READ-AFTER-WRITE ] Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x240000000024, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xe4d96472 | vkCmdDrawIndexed():  Hazard READ_AFTER_WRITE for vertex VkBuffer 0x240000000024[] in VkCommandBuffer 0x55df9d7ffb90[]. Access info (usage: SYNC_VERTEX_ATTRIBUTE_INPUT_VERTEX_ATTRIBUTE_READ, prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: 0, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 7, resource: VkBuffer 0x240000000024[]).
                            Object: 0x240000000024 (type = Buffer(9))
ShabbyX commented 2 weeks ago

Ooooh you know what, I think actually I know where this is coming from. vkCmdBindVertexBuffers doesn't take the size of the buffer, only the offset. So syncval has no way to know how much of the buffer is actually used as vertex input (well, with some gymnastics it might, at least when geometry/tessellation/indirect are not involved).

If you're confident that syncval is taking buffer ranges into account, the VVL issue probably goes away for us if we use vkCmdBindVertexBuffers2 and pass in the sizes.

I remember now we had the same problem with robustness, where we couldn't rely on robustness with vertex data in conjunction with suballocation because the driver couldn't tell where the end of the data is.

Feel free to re-close for now.

artem-lunarg commented 1 week ago

I can reproduce error with --gtest_filter=BufferDataTestES3.CopyBufferSubDataDraw/ES3_Vulkan on my machine it's about index buffer:

[ SYNC-HAZARD-READ-AFTER-WRITE ] Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x280000000028, type = VK_OBJECT_TYPE_BUFFER; | MessageID = 0xe4d96472 | vkCmdDrawIndexed(): Hazard READ_AFTER_WRITE for vertex VkBuffer 0x280000000028[] in VkCommandBuffer 0x2309d9c33a0[]. Access info (usage: SYNC_VERTEX_ATTRIBUTE_INPUT_VERTEX_ATTRIBUTE_READ, prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: 0, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 9, resource: VkBuffer 0x280000000028[]).

I can see a potential problem here (and the same with vertex attributes access). The error says data was written by a copy command and then read by the vertex pipeline but sync validation did not detect a barrier that ensures writes can be used by the following read access (write_barriers: 0). Could it be a missing/incomplete barrier in angle? If you think everything is okay then I can debug sync validation to confirm false positive and if that's the case then it should be a separate issue.

ShabbyX commented 1 week ago

No like I said I'm pretty sure this is just a matter of vkCmdBindVertexBuffers implicitly assuming the vertex data is read from the offset to the end of the buffer (because there's no size). So in a way it's a false positive, but it's Vulkan's fault (vkCmdBindVertexBuffers2 fixes it).

You can close this, when we get a chance to use vkCmdBindVertexBuffers2 and give the size, we can check if this continues to be reported.

artem-lunarg commented 1 week ago

So in a way it's a false positive

We are trying to follow the direction now that syncval does not report false positives. Ether to disable specific validation or to make it optional (disabled by default). Something I need to look at.

artem-lunarg commented 1 week ago

No like I said I'm pretty sure this is just a matter of vkCmdBindVertexBuffers implicitly assuming the vertex data is read from the offset to the end of the buffer

Makes sense, my brain was somewhere else

ShabbyX commented 1 week ago

We are trying to follow the direction now that syncval does not report false positives. Ether to disable specific validation or to make it optional (disabled by default). Something I need to look at.

Sounds good. I guess an easy way to still have some validation for vertex attributes but also remove false positives is to implicitly assume size == 1 when vkCmdBindVertexBuffers is used (or vkCmdBindVertexBuffers2 with pSizes == nullptr).

You won't catch bugs with overlapping areas used for different purposes, but that should be very rare. You would correctly catch syncval errors for the usual usage of a buffer range being used as one entity (because one byte represents the entire thing)