Consider switching from VkEvent polling to VkFence for dma_fence

linkmauve commented 4 years ago

I don’t have a very good understanding of your Vulkan code yet, but this seems to be one of the biggest performance issues of the rsx thread (related to #6882).

VkEvent should only be used for host→device or device→device synchronisation, but the VKTextureCache uses it for device→host synchronisation, by polling on it in vk::wait_for_event().

I tried to make that change myself but so far I’ve failed.

kd-11 commented 4 years ago

Sorry, that is not how fences work on vulkan.

kd-11 commented 4 years ago

I'll summarize the problems here for the discussion. Fences are heavyweight synchronization objects, kinda like a full mutex. At least on linux, this involves flushing the command stream and setting up a dma fence which will signal via interrupt. On vulkan, this would break the commandbuffer approach and therefore fences are only used to signal CB completion (i.e you cannot embed a fence with other instructions, it signals the end of a command sequence). This is not what we want for other complicated reasons, we just need to know how far the GPU has gotten. This is where an event comes in, its not a sync object, its more like an atomic/volatile variable. You can make your own sync variable from it with much better granularity than a fence using it. So why is this important? Consider a command stream with 4 tasks ABCD. Assuming A is the actual DMA flush request, we do not need to wait for D to finish and we shouldn't have to. We can easily set up and event and watch for it to change so that we queue up A-Event-BCD.

Now onto the claims of performance issues waiting for event - this is untrue. Unless the driver is completely broken, the event for a CS will always fire long before the fence is ready. Also, the interrupt driven nature does implicitly mean that you will get yield-like behavior again which is not what we want, just like all the other issues where OS scheduling just bogs performance down for us. Its basically a timed condition variable and you saw what that did to performance last time. Its also worth noting that the path taken to reach this part is in the most critical section of all, a page fault. This path is often taken by SPU jobs doing massive data processing and time saves are important here since dma jobs often hold up resources from other parts of the emulator. TLDR, the 'profiler' will just show the new bottleneck as waiting for fences. The duration of the time lost here is dependent on how quickly your GPU can execute commands, and this is obviously not good for IGPs (which is what I assume you're using).

Btw this is another reason you really need to communicate more on discord, its the third time you have tried to reintroduce code that was already superceded. See https://github.com/RPCS3/rpcs3/commit/a49a0f2a86d2e4cdeff6e1d65d2cfdc76e8ed6ad

Xcedf commented 4 years ago

Was searching through an old code few days ago, and suddenly found this commit a49a0f2 yesterday found this issue, and here's my experiment: commented out tho lines: verify(HERE), vkGetEventStatus(m_device, dma_fence) == VK_EVENT_RESET; and, // Synchronize, reset dma_fence after waiting vk::wait_for_event(dma_fence, GENERAL_WAIT_TIMEOUT); vkResetEvent(m_device, dma_fence);

Normal build: Hacked one:

kd-11 commented 4 years ago

Its fast because you're skipping the wait for the job to actually finish which introduces a race condition. Some games do not like that and will be very flickery if you do that.

Xcedf commented 4 years ago

Yea it's true, some games doing fine, some has desyncs and other not even care, but the main thing it explain the half of the lost fps in GoW during tex cache rewrite to me.

RPCS3 / rpcs3

Consider switching from VkEvent polling to VkFence for dma_fence #7046