KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.79k stars 422 forks source link

Corrupt buffers and attachments on discrete graphics cards #960

Open cklosters opened 4 years ago

cklosters commented 4 years ago

We are porting our engine from OpenGL to Vulkan and have it working (cross-platform) on both Window, Linux and OSX. On OSX however I am experiencing a strange problem when trying to upload buffers and textures using MoltenVK. I find it hard to pin-point the problem exactly, mostly because it only occurs on this specific configuration and no validation warnings or errors are issued.

I first posted this issue on the lunarg website, they think it's a driver issue, hence the opening of the issue here. They also thought that it might be related to issue 832. If that's the case it might be related to the 'MTLHeap' usage.

I have tested the same code on a Mac mini (late 2014 intel iris, Catalina) where everything renders fine. Same applies to Ubuntu 20.04 (intel uhd / geforce 1660ti) and windows 10 (geforce 1660ti / intel uhd). The only configuration that fails is the one mentioned in the title (macOS Catalina 10.15.5, Radeon Pro 570, Imac Retina 2017). All use the same SDK (1.2.141.2)

I'd like to solve the issue, but I'm afraid it's a driver issue, but I could be mistaken. When I inspect the captured metal frame in Xcode my geometry input buffers are of the right size but empty, containing only values of 0.0. The same applies to the texture attachments, they are there, having the right dimensions, but don't contain the texture data. All of them are empty. So all the buffers are created, only empty.

Inspecting the same frame on the Mac mini shows the same buffers and attachments, only filled. We use the most recent release of the vma_allocator and a recent compiled version of glslang and spirv-cross, to convert glsl shaders into spirv-bytecode. I can't use the glslang version that shipped with the SDK because it is compiled without RTTI information. Unfortunately our engine requires that to be there.

Is this a driver or MoltenVK issue? Since it works on all other platforms and configurations? Or are there things I should double check / verify? Since no warnings are issued it's hard to figure out what goes wrong where.

Textures (static and dynamic) and Meshes (static) use staging buffers to upload to GPU_ONLY memory. Meshes that are updated frequently use CPU_GPU (shared) memory. All uploads complete, no warnings / errors are issued. Not by the VK_LAYER_KHRONOS_validation layer or allocator. We use a separate command buffer at the beginning of the frame to upload all data and submit it to the queue. We use only 1 queue.

When I launch our engine, it finds the validation layers and correct vulkan instance version, output:


[info] initializing service: nap::RenderService
[info] Applying layer: VK_LAYER_KHRONOS_validation
[info] Vulkan instance version: 1.2.141
[info] Vulkan requested version: 1.0.0
[info] Found 1 GPU(s):
[info] 0: AMD Radeon Pro 570, type: Discrete, version: 1.0
[info] 0: Compatible
[info] Selected device: 0
[info] Max number of rasterization samples: 8
[info] Sample rate shading: Supported
[info] Applying device extension: VK_KHR_swapchain

I am pretty sure it has something to do with memory allocation. Interestingly enough, there are no issues rendering the user interface (ImGUI, slightly modified imgui_impl_vulkan), which doesn't use the vma_allocator since it's a relatively small set of resources that is updated / uploaded. I slightly modified the imgui implementation to display our custom textures, but since the textures are empty nothing is displayed. The size and dimension of the textures however is correct.

Any help is appreciated.

cklosters commented 4 years ago

After some further investigation and testing I am pretty certain the issue is related to using the vulkan memory allocator in combination with discrete graphics cards and MoltenVK.

A colleague ran our simple 'helloworld' demo on a MacBook Pro (Retina, 15-inch, Intel Iris Pro + NVIDIA GeForce GT 750M) and reported the same issue: When using the integrated graphics card everything works as expected, including retina support. When using the dedicated card all geometry rendered using our engine is corrupt, buffers are there but empty, see screenshots:

Intel Iris Pro

Intel - Screen Shot 2020-07-16 at 10 36 13 am

GeforeGT 750M

Nvidia - Screen Shot 2020-07-16 at 10 52 20 am

I therefore changed the title to not reference a single discrete card anymore. Note that the GUI still renders correct using the standard imgui vulkan implementation on the discrete card. To verify the problem is related to memory allocation, I replaced the existing GUI allocation code with code that uses the Vulkan Memory Allocator. The result is that after this change the gui buffers, including font texture, are corrupt. This causes the GUI not to be displayed anymore.

This is the actual vertex buffer code change (simplified), using the vulkan memory allocator instead of manual memory allocation and mapping. I included the original code (commented out) to show the difference. I attached the actual file.

    /**************************************
     VkBufferCreateInfo buffer_info = {};
     buffer_info.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
     buffer_info.size = vertex_buffer_size_aligned;
     buffer_info.usage = usage;
     buffer_info.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
     err = vkCreateBuffer(v->Device, &buffer_info, v->AllocationCallbacks, &buffer);
     check_vk_result(err);

     VkMemoryRequirements req;
     vkGetBufferMemoryRequirements(v->Device, buffer, &req);
     g_BufferMemoryAlignment = (g_BufferMemoryAlignment > req.alignment) ? g_BufferMemoryAlignment : req.alignment;
     VkMemoryAllocateInfo alloc_info = {};
     alloc_info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
     alloc_info.allocationSize = req.size;
     alloc_info.memoryTypeIndex = ImGui_ImplVulkan_MemoryType(VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, req.memoryTypeBits);
     err = vkAllocateMemory(v->Device, &alloc_info, v->AllocationCallbacks, &buffer_memory);
     check_vk_result(err);

     err = vkBindBufferMemory(v->Device, buffer, buffer_memory, 0);
     check_vk_result(err);
     p_buffer_size = new_size;
    ***************************************/

    // Create buffer information 
    VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
    bufferInfo.size = vertex_buffer_size_aligned;
    bufferInfo.usage = usage;
    bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

    // Create allocation information
    VmaAllocationCreateInfo allocInfo = {};
    allocInfo.usage = VMA_MEMORY_USAGE_CPU_TO_GPU;
    allocInfo.flags = 0;

    // Create buffer
    VkResult result = vmaCreateBuffer(v->mAllocator, &bufferInfo, &allocInfo, &buffer_data.mBuffer, &buffer_data.mAllocation, &buffer_data.mAllocationInfo);
    check_vk_result(result);

    /**************************************
        err = vkMapMemory(v->Device, rb->VertexBufferMemory, 0, vertex_size, 0, (void**)(&vtx_dst));
        check_vk_result(err);
    /**************************************

    err = vmaMapMemory(v->mAllocator, rb->mVertexBuffer.mAllocation, (void**)(&vtx_dst));
    check_vk_result(err);

        for (int n = 0; n < draw_data->CmdListsCount; n++)
        {
            const ImDrawList* cmd_list = draw_data->CmdLists[n];
            memcpy(vtx_dst, cmd_list->VtxBuffer.Data, cmd_list->VtxBuffer.Size * sizeof(ImDrawVert));
            vtx_dst += cmd_list->VtxBuffer.Size;
        }

    vmaUnmapMemory(v->mAllocator, rb->mVertexBuffer.mAllocation);

    /**************************************
        VkMappedMemoryRange range[1] = {};
        range[0].sType = VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE;
        range[0].memory = rb->VertexBufferMemory;
        range[0].size = VK_WHOLE_SIZE;
        err = vkFlushMappedMemoryRanges(v->Device, 1, range);
        check_vk_result(err);
        vkUnmapMemory(v->Device, rb->VertexBufferMemory);
    ***************************************/

I tried various flags but nothing seems to work. I am aware there is a persistent memory mapping issue in MoltenVK when using the vulkan allocator, but we're not using that here, only when dealing with uniform buffers in our engine. All the vertex / index / image buffers are mapped before update and unmapped after update. We would like to continue using the vulkan memory allocator, instead of writing our own. I believe this issue is related to MoltenVK because I don't expect a difference in behaviour when working with integrated / dedicated graphics cards and using a different allocator, but I might be mistaken. If there's more information required lmk.

Also: thanks for the great library and effort. As a small company we don't have the resources to make 3 different render back-ends, this really helps!

imgui_impl_vulkan_vma_allocator.zip

billhollings commented 4 years ago

The vulkan allocator has sometimes caused issues with memory allocation under MoltenVK.

I'll look into this when I get a chance with the info you've given so far.

Do you have a small sample app that can replicate this issue, that you could post to the cloud somewhere? If you do, but don't want to make a public posting, you can send a link to it to support@brenwill.com.

Alternately, you could try modifying the Cube demo in the MoltenVK package (or SDK) to use VMA to try to trigger it there.

billhollings commented 4 years ago

Looking at your code above, some things you can check:

Basically, it would be helpful to understand the Vulkan that VMA is providing compared to your manual Vulkan.

This kind of thing can be important because, for example, on macOS underlying Metal texture memory is never actually memory coherent. We work a fair bit of magic to keep it as coherent as possible on memory unmaps and pipeline barriers to match Vulkan expectations, but it's not airtight.

marcel303 commented 4 years ago

Hi Bill,

I'm the second set of eyes on our end, looking into this issue.

After your very helpful reply, we dug into VMA and did a thorough review of which bits were getting set and how MoltenVK translates memory internally.

The short: We were missing VK_MEMORY_PROPERTY_HOST_COHERENT_BIT on our buffers, causing MoltenVK not to update our buffer contents. It worked fine on other operating systems with discrete graphics, presumably because they always give you host-coherent memory.

The long, MoltenVK seems to:

MTLStorageModeManaged doesn't share memory with the GPU. Metal keeps a copy in main memory, that's easily accessible from the CPU. I'm not sure why it does this. Presumably to give you fast CPU reads (which are dreadfully slow when using uncached memory). Buffers using MTLStorageModeManaged require manually validating regions of modified contents, hence the need to use flushes which trigger [MTLBuffer didModifyRange:] when using Metal. With 'flushes' absent, the memory was never copied to GPU-visible memory.

The example code over at https://vulkan-tutorial.com/ correctly demonstrates allocating memory using VK_MEMORY_PROPERTY_HOST_COHERENT_BIT and HOST_VISIBLE bits. I think we somehow lost setting those bits when switching allocations over to use VMA.

The reason it works on macOS with integrated graphics, is because MoltenVK always uses memory of type MTLStorageModeShared when the system (CPU + GPU) has a unified memory model. I think this is what threw us off-guard, suspecting perhaps a driver bug initially, instead of having a close look at our own code.

@billhollings do you happen to know why MoltenVK uses MTLStorageModeManaged when HOST_COHERENT is not set? Is it to accelerate CPU reads?

Cheers, Marcel

cklosters commented 4 years ago

In addition to the remarks posted by @marcel303, this is our current fix:

        // Create buffer information 
        VkBufferCreateInfo bufferInfo = { VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO };
        bufferInfo.size = size;
        bufferInfo.usage = bufferUsage;
        bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

        // Create allocation information
        VmaAllocationCreateInfo allocInfo = {};
        allocInfo.usage = memoryUsage;
        allocInfo.flags = allocationFlags;
        allocInfo.requiredFlags = memoryUsage == VMA_MEMORY_USAGE_CPU_TO_GPU ? VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT : 0;

        // Create buffer
        VkResult result = vmaCreateBuffer(allocator, &bufferInfo, &allocInfo, &outBuffer.mBuffer, &outBuffer.mAllocation, &outBuffer.mAllocationInfo);
        if (!error.check(result == VK_SUCCESS, "Unable to create buffer, allocation failed"))
            return false;
        return true;

When CPU to GPU memory is required we ensure the allocator uses coherent and visible memory. I assumed the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT would be set by the allocator when using 'VMA_MEMORY_USAGE_CPU_TO_GPU', apparently it is not.

FuzzyQuils commented 1 year ago

Hi Bill,

I'm the second set of eyes on our end, looking into this issue.

After your very helpful reply, we dug into VMA and did a thorough review of which bits were getting set and how MoltenVK translates memory internally.

The short: We were missing VK_MEMORY_PROPERTY_HOST_COHERENT_BIT on our buffers, causing MoltenVK not to update our buffer contents. It worked fine on other operating systems with discrete graphics, presumably because they always give you host-coherent memory.

The long, MoltenVK seems to:

* Give you memory using MTLStorageModeShared when the host-coherent bit is set (write-combined?, uncached?, memory shared with the GPU).

* Give you memory using MTLStorageModeManaged (non-coherent, cached memory) when the bit is not set.

MTLStorageModeManaged doesn't share memory with the GPU. Metal keeps a copy in main memory, that's easily accessible from the CPU. I'm not sure why it does this. Presumably to give you fast CPU reads (which are dreadfully slow when using uncached memory). Buffers using MTLStorageModeManaged require manually validating regions of modified contents, hence the need to use flushes which trigger [MTLBuffer didModifyRange:] when using Metal. With 'flushes' absent, the memory was never copied to GPU-visible memory.

The example code over at https://vulkan-tutorial.com/ correctly demonstrates allocating memory using VK_MEMORY_PROPERTY_HOST_COHERENT_BIT and HOST_VISIBLE bits. I think we somehow lost setting those bits when switching allocations over to use VMA.

The reason it works on macOS with integrated graphics, is because MoltenVK always uses memory of type MTLStorageModeShared when the system (CPU + GPU) has a unified memory model. I think this is what threw us off-guard, suspecting perhaps a driver bug initially, instead of having a close look at our own code.

@billhollings do you happen to know why MoltenVK uses MTLStorageModeManaged when HOST_COHERENT is not set? Is it to accelerate CPU reads?

Cheers, Marcel

3 years later and this just saved me from going insane with a 2013 iMac (NVIDIA GT 750M) rendering nothing; making sure HOST_COHERENT was set on any of my allocations written to from the CPU allowed it to work. At this point though, is this an issue with VMA or MoltenVK?

I'll open an issue in the VMA repo if it's the former, as I feel something could be worked around from VMA's side.