GPUOpen-Drivers / AMD-Gfx-Drivers

Forum for AMD OpenGL and Vulkan graphics drivers
5 stars 3 forks source link

Vulkan, Poor performance due to barrier REGION_BIT being ignored causing full flush #10

Closed owenzhangzhengzhong closed 1 week ago

owenzhangzhengzhong commented 1 week ago

When using pipeline barrier from COLOR_ATTACHMENT -> FRAGMENT SHADER, but the memory flags are COLOR_ATTACHMENT -> INPUT_ATTACHMENTT, which is framebuffer-local (VK_DEPENDENCY_BY_REGION_BIT)

Despite what settings it was passing, it was doing full barrier and invalidating everything, which probably causes a write back of everything to VRAM, and this is VERY slow, Nvidia doesn't suffer from this same problem.

static void ColorBufferBarrier(GSTexture* rt)

{

const VkImageMemoryBarrier barrier = {VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER, nullptr, VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT, VK_ACCESS_INPUT_ATTACHMENT_READ_BIT, VK_IMAGE_LAYOUT_GENERAL, VK_IMAGE_LAYOUT_GENERAL, VK_QUEUE_FAMILY_IGNORED, VK_QUEUE_FAMILY_IGNORED, static_cast<GSTextureVK*>(rt)->GetTexture().GetImage(), {VK_IMAGE_ASPECT_COLOR_BIT, 0u, 1u, 0u, 1u}};

vkCmdPipelineBarrier(g_vulkan_context->GetCurrentCommandBuffer(), VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, VK_DEPENDENCY_BY_REGION_BIT, 0, nullptr, 0, nullptr, 1, &barrier);

}

image

owenzhangzhengzhong commented 1 week ago

Comment from Tyler Schneider:

For the original issue (VK_DEPENDENCY_BY_REGION), I tested pcsx2 with and without the bit set, on 4 different configurations.

NVIDIA system: 3060ti Build: Release AVX2

barrier, region bit: ~5ms frametime barrier, no region bit: ~5ms frametime no barrier: ~3.8ms frametime full barrier: 6.81ms frametime

Intel integrated graphics:

Build: Release AVX2

barrier, region bit: ~12ms frametime barrier, no region bit: ~12ms frametime no barrier: ~6ms frametime full barrier: ~10ms frametime

AMDVLK system: 7900XT Build: Linux default (release)

barrier, region bit: ~10ms frametime barrier, no region bit: ~10ms frametime no barrier: ~4ms frametime full barrier: ~20ms frametime

RADV system: 7900XT Build: Linux default (release)

barrier, region bit: ~14ms frametime barrier, no region bit: ~14ms frametime no barrier: ~6.4ms frametime full barrier: ~14ms frametime

Based on this data, it seems reasonable to conclude VK_DEPENDENCY_BY_REGION does nothing for any of the main vendors (as we suspected, but now we have data for it). I analyzed the barrier itself, and based on the source, it looks like it's stalling rendering after each draw command - probably due to using a feedback loop. I remember DXVK having issues with feedback loops in our hardware a while ago, so maybe this barrier was inserted to fix artifacts? It's a super heavy barrier, so it's not surprise to me that it affects performance so poorly. I think the issue could be related to metadata? I looked around briefly at what VK_EXT_attachment_feedback_loop_layout does, it disables metadata for images we know that are written to and sampled from in the same renderpass. Another interesting data point could be how the barrier affects frametime when the driver knows it's a feedback loop (disabling metadata).

Anyways, compared to RADV, it seems AMDVLK actually performs better in this case. Maybe RADV really doesn't care to optimize for feedback loops that aren't declared as feedback loops (using VK_EXT_attachment_feedback_loop_layout) because they were the main author of the extension.

Additionally: I modded in the feedback loop extension into the pcsx2 source code just to see how it affected performance:

the 7900xt tests on AMDVLK (originally ~10ms) went down to ~6ms, so almost doubling the performance. I let the pcsx2 team know that the cause of their slow barriers is the feedback loop without declaring it being a feedback loop.

Interestingly, the extension seems to have no effect on RADV on my test machine. Considering the extension was proposed by Valve (probably for RADV), maybe the RADV on Ubuntu 22/23 is too old (especially since 24.04 is out now)

owenzhangzhengzhong commented 1 week ago

Workaround is to use feedback loop extension.