GPUOpen-Drivers / AMD-Gfx-Drivers

Forum for AMD OpenGL and Vulkan graphics drivers
5 stars 3 forks source link

Issue with vkCmdPipelineBarrier and depth buffers. #8

Open Onhi opened 3 weeks ago

Onhi commented 3 weeks ago

Something is up with vkCmdPipelineBarrier... It's messing up the content of depth buffers.

GPU: RX 7900XTX Driver Version: 31.0.24033.1003 OS: Windows 11 Pro (10.0.22000 Build 22000)

Here's an exemple:

srcStageMask         VK_PIPELINE_STAGE_2_EARLY_FRAGMENT_TESTS_BIT | VK_PIPELINE_STAGE_2_LATE_FRAGMENT_TESTS_BIT
srcAccessMask        VK_ACCESS_2_MEMORY_READ_BIT | VK_ACCESS_2_MEMORY_WRITE_BIT
dstStageMask         VK_PIPELINE_STAGE_2_ALL_COMMANDS_BIT
dstAccessMask        VK_ACCESS_2_MEMORY_READ_BIT
oldLayout            VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
newLayout            VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL

before

after

I know this looks like a problem with the application but I almost certain it is not. Vulkan validation layers are not showing any warning/errors, RenderDoc can capture and replay the issue... And the same code is running flawlessly on team green hardware.

I can help finding the issue providing time, renderdoc captures, vulkan api dumps, code (possible but tricky), etc...

Thanks!

owenzhangzhengzhong commented 3 weeks ago

Hi @Onhi, Can you provide compilable source code that reproduces this problem? Anything else you can provide would also be helpful. Thanks, Owen

Onhi commented 3 weeks ago

I'll get to reducing the engine to a simple case demonstrating the issue. In the meantime, here's a link to a renderdoc capture (compatible with an RX 7900XTX) and a reduced API dump of the issue. In the following, the resource "camera_depth" is the one showcasing the issue. https://drive.google.com/drive/folders/13M_uPHq1WdBmEf0FVYVS0QKMV86O7pkb?usp=drive_link

Onhi commented 3 weeks ago

I got a version of the code ready, where should I put it?

jinjianrong commented 3 weeks ago

@Onhi You may create a repository in your github space https://github.com/Onhi?tab=repositories to put your code

Onhi commented 3 weeks ago

I cant put the code on github. I've create a rar file in the google drive folder I linked just above for you to retrieve the code.

owenzhangzhengzhong commented 3 weeks ago

Thanks @Onhi, once I get access to the folder I will create internal ticket for this issue.

Onhi commented 3 weeks ago

@owenzhangzhengzhong you should be able to access the folder now.

owenzhangzhengzhong commented 3 weeks ago

Hey @Onhi, Able to access the folder, and was able to compile and see the issue in the editor on my local system. I've created internal ticket to track this issue, and we'll look into it. In the meantime, is it possible for you to simplify the testcase for this issue? There's a lot of source to go through, if you can reduce it as much as possible to just rendering with the vkCmdPipelineBarrier that's causing the corruption that would make it much easier for us to investigate. Thanks, Owen

Onhi commented 3 weeks ago

Cool, thanks Owen, I'll do my best to reduce the code to a minimum. Might take some time but I'll get to it during the weekend.

Onhi commented 2 weeks ago

Hi, I'm having an hard time reducing the engine more than what I have provided while still providing a solid repro case (we need to move a camera to change the range of captured information and or resize images... so its not trivial to isolate).

Doing more tests, I see irregular behavior around depth/stencil images & layout transition barriers. On some captures, its the depth that is blocky and bugged (like in the rdc capture I provided) other times it's the stencil aspect of the image that is destroyed.

Full disclaimer, in my effort to reduce the size of the code, I noticed an issue with one of the barrier pair in the code I provided. Around the OIT pass, I wrongfully removed the pair of barriers that where converting from shader to depth stencil and vice versa. Still, fixing it had no impact on the issue.

I understand that this is a foreign code base so I would like to offer help debugging the issue if needed. (via video conf. or otherwise).

owenzhangzhengzhong commented 2 weeks ago

Hi @Onhi, Cursory look with synchronization validation enabled shows some errors: SYNC-HAZARD-READ-AFTER-WRITE SYNC-HAZARD-WRITE-AFTER-WRITE Associated with camera_depth and camera_image_view, can you look at how you're setting up the pipeline barriers associated with those command submissions. See attached full log of validation errors: Editor.txt Meanwhile we'll look further as well, Owen

Onhi commented 2 weeks ago

Fixed the validation errors using store op none (instead of store) on the read only usage of the camera_depth buffer in the OIT pass.

Sadly it didnt fix the main issue.

owenzhangzhengzhong commented 2 weeks ago

Did you fix all the validation issues? Can you post that code as well? With details on which lines you updated?

Onhi commented 2 weeks ago

Hi, I've uploaded a v2 of the reduced engine with fixes for validation errors and details of lines changed.

Onhi commented 2 weeks ago

I've updated the version again with more fixes (V3). This version can go through queue synchronization validatation & synchronization2 layer without error/warnings.

The problem with depth is still unresolved. :(

owenzhangzhengzhong commented 1 week ago

Hi @Onhi, Brief update, was able to not see the issue with triggering CmdBarrier call after every command driver side: Before: image

After: image (This is with resize the windows for a bit)

After narrowing it down a bit it seems related to missing barriers after commands associated with direct dispatch and transfer copies involving an image. Still narrowing it down further.

Onhi commented 4 days ago

Wow! Very nice progress! Thanks for the update.

owenzhangzhengzhong commented 1 day ago

Hi @Onhi,

After narrowing it down further I see the issue resolved by adding this barrier:

VkMemoryBarrier mem_barrier{};
mem_barrier.sType         = VK_STRUCTURE_TYPE_MEMORY_BARRIER;
mem_barrier.srcAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_READ_BIT;
mem_barrier.dstAccessMask = VK_ACCESS_2_NONE;

vkCmdPipelineBarrier(command_buffer,
                     VK_PIPELINE_STAGE_NONE,
                     VK_PIPELINE_STAGE_2_TOP_OF_PIPE_BIT_KHR,
                     0,
                     1,
                     &mem_barrier,
                     0,
                     nullptr,
                     0,
                     nullptr);

In the file: AMDReproEngine\Oasis\code\Engine\Vulkan\Vulkan.Interface.h In the following functions after the Vulkan API calls: void copyImage(... After: vkCmdCopyImage2(command_buffer, &copy_image_info);

void copyBufferToImage(... After: vkCmdCopyBufferToImage2(command_buffer, &copy_buffer_to_image_info);

void copyImageToBuffer(... After: vkCmdCopyImageToBuffer2(command_buffer, &copy_image_to_buffer_info);

void blitImage(... After: vkCmdBlitImage2(command_buffer, &blit_image_info);

void dispatch(... After: vkCmdDispatch(command_buffer, p_group_count_x, p_group_count_y, p_group_count_z);

I see issue mostly resolved after adding the barrier in copyImage and dispatch. I suspect further optimizations can be made.