felixdoerre / primus_vk

Vulkan GPU-offloading layer
BSD 2-Clause "Simplified" License
230 stars 18 forks source link

properly use cached mapped memory #79

Closed KoykL closed 3 years ago

KoykL commented 3 years ago

I realize that:

vkInvalidateMappedMemoryRanges guarantees that device writes to the memory ranges described by pMemoryRanges, which have been made available to the host memory domain using the VK_ACCESS_HOST_WRITE_BIT and VK_ACCESS_HOST_READ_BIT access types, are made visible to the host.

So It seems we indeed need to use InvalidateMappedMemoryRanges even for read only mapped region (render_host_mem).

On the other hand, we can also make display_src_mem also VK_MEMORY_PROPERTY_HOST_CACHED_BIT, we just need to call vkFlushMappedMemoryRanges after writing to that region.

Performance is similar after the change.

felixdoerre commented 3 years ago

Regarding the contents of the change: I am not sure what caching would make as difference for the display memory, however for my intel integrated graphics it cannot make a difference, as it only has exactly one memory type with these flags:

            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT

So, changing the the requested flags didn't make a difference at all. So we would need a "local" graphics device with non-cached memory to understand a difference here.

Looking at the memory types for my Nvidia driver, I found these 3 host-visible:

    memoryTypes[8]:
        heapIndex     = 1
        propertyFlags = 0x0006: count = 2
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                color images
                (non-sparse, non-transient)
    memoryTypes[9]:
        heapIndex     = 1
        propertyFlags = 0x000e: count = 3
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                color images
                (non-sparse, non-transient)
    memoryTypes[10]:
        heapIndex     = 2
        propertyFlags = 0x0007: count = 3
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                None

As all memoryTypes which are CACHED, are also COHERENT, there wasn't a need for InvalidateMappedMemoryRanges for my system.

The recommendations for choosing the memory types from the spec are:

As such, it is generally inadvisable to use device coherent or device uncached memory except when really needed.

So for the complete detection, we should probably (in order) choose:

  1. CACHED and not COHERENT (I didn't observe such a memory for any type)
  2. CACHED and COHERENT
  3. COHERENT

where we only consider memory types supporting the required image types. So in case of my NVIDIA card we would choose memory type 9 (as we do right now). If that wouldn't exist, fall back to type 8, but never even consider type 10.

I am thinking whether we could make logic to skip InvalidateMappedMemoryRanges/FlushMappedMemoryRanges, if the memory (luckily) has the MEMORY_PROPERTY_HOST_COHERENT_BIT set (cases 1 and 3), however I believe it will be very easy for a driver to skip whatever logic is inside that functions based on memory type flags. This is a cheap optimization that I believe every graphics driver would make, so the additional logic inside primus-vk would probably not be worth it.

What do you think of that plan? If you agree, and want to implement it, you can adujst this PR to match the described detection logic. Otherwise, I'd ask you to correct the "style"-remarks and merge it as-is, and implement the more elaborate detection myself later.

KoykL commented 3 years ago

(accidentally deleted all of my comment. Have to retype the whole thing.)

For display_mem, while my display GPU do have cached memory, VK_MEMORY_PROPERTY_HOST_CACHED_BIT actually makes write slower. (Maybe due to extra copy?) According to this article https://gpuopen.com/learn/vulkan-device-memory/, it seems either HOST_COHERENT or HOST_COHERENT | DEVICE_LOCAL is good for uploading data to GPU. So the original choice is good enough.

The article was saying HOST_COHERENT | DEVICE_LOCAL is good for frequently changing data. I tested that, and memcpy become about 30% slower. I guess the following GPU read will become faster, as the data is now on GPU. However, the whole thing probably doesn't matter for most users, as you mentioned there is only single memory type on integrated GPU.

For render_host_mem, I profiled with and without InvalidateMappedMemoryRanges. It does seem the whole memcpy process become about 30% slower with InvalidateMappedMemoryRanges, even though my render GPU's only host_cached memory type is also host_coherent.

The memcpy performance probably doesn't matter much though. On my hardware, the time between queuepresent and memcpy start is dominating by an order of magnitude.(specifically images[index].render_copy_fence.await(); images[index].render_copy_fence.reset();) 30% performance difference with memcpy has almost no impact on overall performance.

For complete memory type detection, I think your plan sounds great. However, I probably don't have time for that for now, so feel free to go ahead and take that.