felixdoerre / primus_vk

Vulkan GPU-offloading layer
BSD 2-Clause "Simplified" License
230 stars 18 forks source link

Performance improvements by using VK_MEMORY_PROPERTY_HOST_CACHED_BIT for render_host_mem #78

Closed KoykL closed 3 years ago

KoykL commented 3 years ago

When I am running thehunter call of the wild, the fps is stuck in the ~10 range, and gpu utilization from nvidia-smi is about 30%. I performed profiling, and discovered most of the time is spent in memcpy (Maybe that's because I am using a 4K screen?).

I found on the internet that reading from a mapped vulkan memory object created with VK_MEMORY_PROPERTY_HOST_COHERENT_BIT is very slow. Since render_host_mem will only be read from the host, maybe there's no need for VK_MEMORY_PROPERTY_HOST_COHERENT_BIT.

I tried changing VK_MEMORY_PROPERTY_HOST_COHERENT_BIT to VK_MEMORY_PROPERTY_HOST_CACHED_BIT. That greatly reduced the time spent in memcpy. (I forgot the exact number, but it's several magnitudes of reduction.) memcpy is no longer the bottleneck, my fps jumps to ~30, and my gpu utilization become 100%.

I am new to vulkan, and I am not sure if this is the correct thing to do. It does provide significant performance improvements though, and the game still runs correctly.

Hardware: Display GPU: Radeon Pro WX 5100 Render GPU: GeForce RTX 2080 Ti

felixdoerre commented 3 years ago

Wow that looks impressing, and I think, that this change is ok. Citing the relevant parts from vkspec:

VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bit specifies that the host cache management commands vkFlushMappedMemoryRanges and vkInvalidateMappedMemoryRanges are not needed to flush host writes to the device or make device writes visible to the host, respectively. VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit specifies that memory allocated with this type is cached on the host. Host memory accesses to uncached memory are slower than to cached memory, however uncached memory is always host coherent. As we use the memory with render_host_mem_flags readonly, this change is seems to be ok.

Did the profiling markers help you? I've got a small script here for visualizing the output, however I never felt the need to commit it to the repository. But probably it could have helped you. Using it on profiling output before and after the change showed that the memcpy also is faster on my hardware.

felixdoerre commented 3 years ago

I know it's too late now, as you already found out what to change :-), but here is the script I use to display the profiling data to understand where what is waiting on what: https://github.com/felixdoerre/primus_vk/tree/master/profiling