KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
754 stars 404 forks source link

VVL miss a vulkan device lost check when reporting VUID-vkDestroy* error messages #6310

Closed HuYuxin closed 6 months ago

HuYuxin commented 1 year ago

Environment:

Describe the Issue

A clear and concise description of what the bug is.

When VVL processes vkDestroy*() calls, it doesn't seem to cover the case when the vulkan device is lost and the vulkan resources can't finish execution on the GPU due to device loss.

To reproduce:

  1. Follow ANGLE Development Setup to get ANGLE source code.

  2. Follow Setting up the ANGLE build for Android to download Android build dependency and set up GN args for building Android target. Make sure the vulkan validation layer is enabled by adding below line to the GN arg:

angle_enable_vulkan_validation_layers = true
  1. Run this dEQP test from ANGLE repo root on an Android device with Android build version of 13 or above:
out/Android/angle_deqp_egl_tests --gtest_filter=dEQP?EGL.functional.robustness.reset_context.shaders.out_of_bounds_non_robust.reset_status.writes.local_array.fragment --verbose --local-output --num-retries=0 --skip-clear-data
  1. Observe that the test failed with below VVL error messages:
[ VUID-vkDestroyFence-fence-01120 ] Validation Error: [ VUID-vkDestroyFence-fence-01120 ] Object 0: handle = 0x280000000028, type = VK_OBJECT_TYPE_FENCE; | MessageID = 0x5d296248 | VkFence 0x280000000028[] is in use. The Vulkan spec states: All queue submission commands that refer to fence must have completed execution

[ VUID-vkDestroyPipeline-pipeline-00765 ] Validation Error: [ VUID-vkDestroyPipeline-pipeline-00765 ] | MessageID = 0x6bdce5fd | Cannot call vkDestroyPipeline on VkPipeline 0x200000000020[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to pipeline must have completed execution

[ VUID-vkDestroyBuffer-buffer-00922 ] Validation Error: [ VUID-vkDestroyBuffer-buffer-00922 ] | MessageID = 0xe4549c11 | Cannot call vkDestroyBuffer on VkBuffer 0x240000000024[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to buffer, either directly or via a VkBufferView, must have completed execution

[ VUID-vkDestroyRenderPass-renderPass-00873 ] Validation Error: [ VUID-vkDestroyRenderPass-renderPass-00873 ] | MessageID = 0x473619ad | Cannot call vkDestroyRenderPass on VkRenderPass 0x270000000027[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to renderPass must have completed execution

[ VUID-vkDestroyBuffer-buffer-00922 ] Validation Error: [ VUID-vkDestroyBuffer-buffer-00922 ] | MessageID = 0xe4549c11 | Cannot call vkDestroyBuffer on VkBuffer 0x180000000018[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to buffer, either directly or via a VkBufferView, must have completed execution

[ VUID-vkDestroyImageView-imageView-01026 ] Validation Error: [ VUID-vkDestroyImageView-imageView-01026 ] | MessageID = 0x63ac21f0 | Cannot call vkDestroyImageView on VkImageView 0x210000000021[] that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to imageView must have completed execution

[ VUID-vkDestroyCommandPool-commandPool-00041 ] Validation Error: [ VUID-vkDestroyCommandPool-commandPool-00041 ] Object 0: handle = 0x6dbb9fa3d0, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0xad474cda | Attempt to destroy command pool with VkCommandBuffer 0x6dbb9fa3d0[] which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state

Expected behavior

Test passes without the VVL error message

Valid Usage ID VUID-vkDestroyFence-fence-01120, VUID-vkDestroyPipeline-pipeline-00765, VUID-vkDestroyBuffer-buffer-00922, VUID-vkDestroyRenderPass-renderPass-00873, VUID-vkDestroyBuffer-buffer-00922, VUID-vkDestroyImageView-imageView-01026, VUID-vkDestroyCommandPool-commandPool-00041

Additional context

In the reproduce example, application calls vkDestroy*() to clean up all the resources after the vulkan device is lost. According to the spec: When a device is lost, its child objects are not implicitly destroyed and their handles are still valid. Those objects must still be destroyed before their parents or the device can be destroyed (see the Object Lifetime section).. This means that if the vulkan device is lost, the application should still be able to destroy the vulkan objects, even if the vulkan commands have not finished execution yet due to vulkan device lost.

In short, can we add a vulkan device lost check when processing vkDestroy*() calls, and not throw the VUID-vkDestroy* errors if the vulkan device is already lost?

code or terminal output ```sh # callstacks, crashes, etc. # EX: Validation Error: [ VUID-vkCmdDrawMultiEXT-colorAttachmentCount-06188 ] Object 0: handle = 0x3d47e60 ... ```
HuYuxin commented 10 months ago

Is there any plan to address this in the near future?

spencer-lunarg commented 10 months ago

sorry, I can add to my plate for the week

HuYuxin commented 10 months ago

Thank you! No worries, just want to follow-up so that we can plan accordingly on our side.

spencer-lunarg commented 10 months ago

(making notes) I think the way forward on this is to track all returns of VK_ERROR_DEVICE_LOST and from there un-mark all the objects as "being used" in object tracker

spencer-lunarg commented 10 months ago

@HuYuxin I tried looking at this a bit more, it is hard for me to reproduce, I wasn't able to get Angle built for Android locally, will try again tomorrow

HuYuxin commented 9 months ago

Thank you @spencer-lunarg for the work! Please let me know if you need help with building ANGLE for Android.

Is the change the solution to the issue? I tried applying the change, but the test dEQP?EGL.functional.robustness.reset_context.shaders.out_of_bounds_non_robust.reset_status.writes.local_array.fragment still failed, with the same VVL error. Example error message: [ VUID-vkDestroyBuffer-buffer-00922 ] Validation Error: [ VUID-vkDestroyBuffer-buffer-00922 ] | MessageID = 0xe4549c11 | vkDestroyBuffer(): can't be called on VkBuffer 0x1a000000001a[] that is currently in use by VkCommandBuffer 0x780cab6ad0[].

HuYuxin commented 6 months ago

Hi @spencer-lunarg can I ask for an update on this ticket. Just want to check-in if you need any help from us (getting the ANGLE build for Android running, reproducing the issue, etc) to push this ticket forward?

spencer-lunarg commented 6 months ago

So first apologize, I have some time now since we just got the SDK branch done to look at this now

I tried running ./angle_deqp_egl_tests --gtest_filter=dEQP-EGL.functional.robustness.reset_context.shaders.out_of_bounds_non_robust.reset_status.writes.local_array.fragment --verbose --local-output --num-retries=0 --skip-clear-data on my Linux RADV Mesa machine but don't see the issue. I have an Android 13 Pixel device, it just will take some time to build and get that whole env setup working

In the mean time, I think I was able to reproduce this with the MockICD on Linux by having a way to "force" a device lost... let me try that first, but the "core" issue is just adding better "Device Lost" support in the Validation Layers

spencer-lunarg commented 6 months ago

@HuYuxin so I think I got it working... I just merged #7715

can you confirm this fixes everything

HuYuxin commented 6 months ago

Thank you @spencer-lunarg for working on this within your busy schedules.

I applied your change, most of the original VVL error is gone, except one VVL is still being thrown:

[ VUID-vkDestroyCommandPool-commandPool-00041 ] Validation Error: [ VUID-vkDestroyCommandPool-commandPool-00041 ] Object 0: handle = 0x7a58aba050, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x20000000002, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xad474cda | vkDestroyCommandPool():  (VkCommandBuffer 0x7a58aba050[]) is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkDestroyCommandPool-commandPool-00041)

Can this be fixed with a follow-up change?

Regarding repro the problem, not all vulkan driver will end up with device lost when there is write out of bounds access in fragment shader, which is probably why you don't see it on Linux RADV Mesa machine. Which Pixel device do you have? Would you be able to download and flash the Android image for the Pixel device you have from https://developers.google.com/android/images? w.r.t. ANGLE, I can provide you with an ANGLE test apk that you can directly use without building ANGLE from scratch.

spencer-lunarg commented 6 months ago

@HuYuxin I see I missed the VkCommandPool... I can quickly fix that now

for the "reproduce case", while it is nice to have something on Android, I really want something that we will test in CI

Our CI runs with a MockICD driver and I added logic to have it return a DEVICE_LOST when I want, that way I can catch regressions

HuYuxin commented 6 months ago

Thank you @spencer-lunarg! I verified that with both changes: commit 1, commit 2, we no longer see any VVL errors. Thank you for helping fixing this.