KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.78k stars 421 forks source link

Update buffer data and use buffer device address in shader #1846

Open maksym-pasichnyk opened 1 year ago

maksym-pasichnyk commented 1 year ago

I'm trying to update buffer by mapping HostVisible + HostCoherent memory and updating via memcpy, but data inside shader is wrong.

auto vertices = std::array{
    Vec2<f32>{-1.0F, -1.0F},
    Vec2<f32>{+1.0F, -1.0F},
    Vec2<f32>{-1.0F, +1.0F},
    Vec2<f32>{+1.0F, +1.0F},
};

auto indices = std::array{
    0u, 1u, 2u,
    2u, 1u, 3u
};

GpuBufferInfo index_buffer_info{};
GpuBufferInfo vertex_buffer_info{};

vulkan->AllocateTemporary(&index_buffer_info, sizeof(indices), alignof(u32));
vulkan->AllocateTemporary(&vertex_buffer_info, sizeof(vertices), alignof(Vec2<f32>));

Code below not working

// This code not working
auto memory = vulkan->context.frame_allocators[vulkan->frame_index].allocation.device_memory;
void* mapped = vulkan->context.logical_device.mapMemory(memory, 0, VK_WHOLE_SIZE, vk::MemoryMapFlags());
std::memcpy(reinterpret_cast<u8*>(mapped) + index_buffer_info.offset, indices.data(), sizeof(indices));
std::memcpy(reinterpret_cast<u8*>(mapped) + vertex_buffer_info.offset, vertices.data(), sizeof(vertices));
vulkan->context.logical_device.unmapMemory(memory);

If I add line below before mapMemory or after unmapMemory, everything working fine

u32 tmp = 0;
cmd.updateBuffer(vulkan->context.frame_allocators[vulkan->frame_index].buffer, 0, sizeof(u32), &tmp);

Also when I updating buffers with cmd.updateBuffer, everything is working fine

auto buffer = vulkan->context.frame_allocators[vulkan->frame_index].buffer;
cmd.updateBuffer(buffer, index_buffer_info.offset, sizeof(indices), indices.data());
cmd.updateBuffer(buffer, vertex_buffer_info.offset, sizeof(vertices), vertices.data());
billhollings commented 1 year ago

Are you sure the device memory is host coherent? It seems to be behaving as if the device memory isn't host coherent.

Is this working on other Vulkan platforms?

maksym-pasichnyk commented 1 year ago

@billhollings Yes, memory is host coherent. I don't test on other Vulkan platforms, I'll check and reply later

auto gpu_find_memory_type_index(GpuContext* context, u32 memory_type_bits, vk::MemoryPropertyFlags memory_property_flags) -> u32 {
    auto memory_properties = context->physical_device.getMemoryProperties();
    for (u32 i = 0; i < memory_properties.memoryTypeCount; ++i) {
        if ((memory_type_bits & (1 << i)) && (memory_properties.memoryTypes[i].propertyFlags & memory_property_flags) == memory_property_flags) {
            return i;
        }
    }
    return std::numeric_limits<u32>::max();
}

void gpu_allocate_memory(GpuContext* context, GpuAllocation* allocation, vk::MemoryRequirements memory_requirements, vk::MemoryPropertyFlags memory_property_flags, vk::MemoryAllocateFlags memory_allocate_flags) {
    auto memory_type_index = gpu_find_memory_type_index(context, memory_requirements.memoryTypeBits, memory_property_flags);

    auto memory_allocate_flags_info = vk::MemoryAllocateFlagsInfo()
        .setFlags(memory_allocate_flags);

    auto memory_allocate_info = vk::MemoryAllocateInfo()
        .setPNext(&memory_allocate_flags_info)
        .setAllocationSize(memory_requirements.size)
        .setMemoryTypeIndex(memory_type_index);

    allocation->device_memory = context->logical_device.allocateMemory(memory_allocate_info);
    allocation->memory_requirements = memory_requirements;
    allocation->memory_property_flags = memory_property_flags;
}

void gpu_create_allocator(GpuContext* context, GpuLinearAllocator* allocator, vk::DeviceSize capacity) {
    vk::BufferUsageFlags buffer_usage_flags = {};
    buffer_usage_flags |= vk::BufferUsageFlagBits::eTransferSrc;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eTransferDst;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eUniformTexelBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eStorageTexelBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eUniformBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eStorageBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eIndexBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eVertexBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eIndirectBuffer;
    buffer_usage_flags |= vk::BufferUsageFlagBits::eShaderDeviceAddress;

    auto buffer_create_info = vk::BufferCreateInfo()
        .setSize(capacity)
        .setUsage(buffer_usage_flags)
        .setSharingMode(vk::SharingMode::eExclusive);

    allocator->buffer = context->logical_device.createBuffer(buffer_create_info);
    allocator->capacity = capacity;
    allocator->offset = 0u;

    vk::MemoryAllocateFlags memory_allocate_flags = {};
    memory_allocate_flags |= vk::MemoryAllocateFlagBits::eDeviceAddress;

    auto memory_requirements = context->logical_device.getBufferMemoryRequirements(allocator->buffer);

    vk::MemoryPropertyFlags memory_property_flags = {};
    memory_property_flags |= vk::MemoryPropertyFlagBits::eHostVisible;
    memory_property_flags |= vk::MemoryPropertyFlagBits::eHostCoherent;

    gpu_allocate_memory(context, &allocator->allocation, memory_requirements, memory_property_flags, memory_allocate_flags);
    context->logical_device.bindBufferMemory(allocator->buffer, allocator->allocation.device_memory, 0);

    auto buffer_device_address_info = vk::BufferDeviceAddressInfo()
        .setBuffer(allocator->buffer);

    allocator->device_address = context->logical_device.getBufferAddress(buffer_device_address_info);
}
maksym-pasichnyk commented 1 year ago

@billhollings This code is working on windows

auto memory = vulkan->context.frame_allocators[vulkan->frame_index].allocation.device_memory;
void* mapped = vulkan->context.logical_device.mapMemory(memory, 0, VK_WHOLE_SIZE, vk::MemoryMapFlags());
std::memcpy(reinterpret_cast<u8*>(mapped) + index_buffer_info.offset, indices.data(), sizeof(indices));
std::memcpy(reinterpret_cast<u8*>(mapped) + vertex_buffer_info.offset, vertices.data(), sizeof(vertices));
vulkan->context.logical_device.unmapMemory(memory);
billhollings commented 1 year ago

Odd.

Do you have a small demo app you can post I link to so we can replicate the issue here?

maksym-pasichnyk commented 1 year ago

@billhollings https://github.com/maksym-pasichnyk/vulkan-gpu

billhollings commented 1 year ago

@billhollings https://github.com/maksym-pasichnyk/vulkan-gpu

I downloaded your app and can open it. During opening, there are 3 calls to vkMapMemory() and one call to vkUnmapMemory(), none of which seem to be originating in the code you have above.

The graphical menu system that appears in this app is quite large. Can you provide us with instructions about how to replicate the problem you are having above?

maksym-pasichnyk commented 1 year ago

This program is a compute shader rasterizer. If you enable the "Use memcpy" toggle, memcpy will be used instead of cmd.updateBuffer. UI will disappear after a few frames on macOS, but will work on Windows

Screenshot 2023-03-09 at 09 50 40
maksym-pasichnyk commented 1 year ago

https://user-images.githubusercontent.com/11374170/223957204-c2522b5d-ab6b-4f61-a1de-eed84502c281.mov

billhollings commented 1 year ago

Sorry for the delay. We've been busy getting a release ready for the upcoming SDK.

The problem was repeatable in the project you posted above, but the code in main.cpp in the project you posted above is quite different than the code you posted previously.

I'll comment on the code in the project, which includes:

            for (auto cmd_list : std::span(draw_data->CmdLists, draw_data->CmdListsCount)) {
                auto vtx_buffer_size = cmd_list->VtxBuffer.Size * sizeof(ImDrawVert);
                auto idx_buffer_size = cmd_list->IdxBuffer.Size * sizeof(ImDrawIdx);

                GpuBufferInfo vtx_buffer_info;
                GpuBufferInfo idx_buffer_info;
                if (!vulkan->AllocateTemporary(&vtx_buffer_info, vtx_buffer_size, alignof(ImDrawVert))) {
                    fprintf(stderr, "Failed to allocate vertex buffer for ImGui\n");
                    continue;
                }

                if (!vulkan->AllocateTemporary(&idx_buffer_info, idx_buffer_size, alignof(ImDrawIdx))) {
                    fprintf(stderr, "Failed to allocate index buffer for ImGui\n");
                    continue;
                }

                if (use_memcpy) {
                    std::memcpy(vtx_buffer_info.mapped, cmd_list->VtxBuffer.Data, vtx_buffer_size);
                    std::memcpy(idx_buffer_info.mapped, cmd_list->IdxBuffer.Data, idx_buffer_size);
                } else {
                    UpdateBuffer(cmd, &vtx_buffer_info, cmd_list->VtxBuffer.Data, vtx_buffer_size);
                    UpdateBuffer(cmd, &idx_buffer_info, cmd_list->IdxBuffer.Data, idx_buffer_size);
                }

In this code when we ran it, draw_data->CmdListsCount is 2, so you draw two separate cmd_lists in each frame, but vtx_buffer_info.mapped and idx_buffer_info.mapped are dependent only on the frame_index, which means you use memcpy to overwrite the same device memory range with data for both cmd_lists before using it for either cmd_list. Because the GPU doesn't see the device memory contents until the command buffer is executed, it uses the data for the second cmd_list for both.

The UpdateBuffer path works, because you submit a separate UpdateBuffer command to the command buffer for each cmd_list.

If I add to this code an index that tracks which command list is being submitted, and only allow the memcpy path on the first cmd_list in the frame, and let the second follow the UpdateBuffer path, everything works as expected.


[Edit:] I've just realized that vtx_buffer_info.mapped and idx_buffer_info.mapped are moved forward on each cmd_list iteration within a frame, so the problem is not as simple as I originally described here. But it's interesting that limiting the memcpy to only one cmd_list per frame fixes this.

billhollings commented 1 year ago

Okay. I've figured out the actual problem here now.

The problem is that despite the memory being "shared" between CPU and GPU, by using memcpy() to update the contents in a MTLBuffer, and then using only the gpuAddress of the MTLBuffer in a shader, the GPU actually has no knowledge that the MTLBuffer is being used after being updated, and it doesn't update whatever GPU-side caches need to be updated.

This is a similar situation to using Metal argument buffers, and the solution is the same, calling Metal's useResource: for any MTLBuffer we "think" might be accessed by a shader without the buffer actually being declared to the shader via setBuffer:offset:atIndex:.

We'll have to track any VkDeviceMemory objects that are:

and then invoke useResource: with the underlying MTLBuffer whenever a pipeline that includes a shader that uses PhysicalStorageBufferAddresses is bound.

The reason that vkCmdUpdateBuffer() avoids this is because it invokes a buffer BLIT operation on the GPU, thereby identifying to the GPU that its caches for that MTLBuffer should be updated.

maksym-pasichnyk commented 1 year ago

Glad to hear it's not my code problem, but sad that it's actually a bug

maksym-pasichnyk commented 1 year ago

I'm wondering why this issue exists, I thought the host coherent memory was actually a MTLStorageModeShared and there no need for additional GPU notification

billhollings commented 1 year ago

I thought the host coherent memory was actually a MTLStorageModeShared and there no need for additional GPU notification

It is indeed using MTLStorageModeShared.

But I believe this issue is about letting the GPU know that the resource is being used by a render or compute pass. In that sense, it is similar to using the resources in argument buffers. The encoder useResource: family of methods are used to tell the GPU to make a resource "resident" during a render or compute pass. I interpret "resident" to mean loaded into, or referenceable from, on-chip caches.

billhollings commented 1 year ago

PR #1868 fixes this. Please retest with latest MoltenVK and close this issue if it is fixed.

maksym-pasichnyk commented 11 months ago

Sorry for long response. This code now works with latest MoltenVK, but I found new case where this problem still exist, I'll create a small demo app for reproducing this.

johnpayne-dev commented 9 months ago

Hello, I'm experiencing an issue with MoltenVK that I believe is related to this bug. I am trying to render a point cloud using buffer references with this vertex shader:

#version 460
#extension GL_EXT_buffer_reference : require

layout (location = 0) out vec4 vertex_color;

layout (buffer_reference) readonly buffer vertex_buffer_t {
    float data[];
};

layout (push_constant) uniform push_constant_t {
    mat4 transform;
    vertex_buffer_t vertex_buffer;
} constants;

vec3 get_vertex_position(uint i)
{
    return vec3(constants.vertex_buffer.data[3 * i + 0],
                constants.vertex_buffer.data[3 * i + 1],
                constants.vertex_buffer.data[3 * i + 2]);
}

void main()
{
    vec3 position = get_vertex_position(gl_VertexIndex);
    gl_Position = constants.transform * vec4(position, 1.0);
    gl_PointSize = 10.0;
    vertex_color = vec4(1.0);
}

but it only works for a few frames before constants.vertex_buffer.data starts evaluating to 0, similar to the clip that @maksym-pasichnyk posted. The point cloud data is uploaded by memcpy'ing it into a mapped staging buffer then calling vkCmdCopyBuffer to copy it to the gpu.

I am using Vulkan SDK v1.3.168.1, and this code works with no issues on Windows/NVIDIA drivers. I am also using VK_KHR_dynamic_rendering, if that influences anything.

kanerogers commented 7 months ago

@billhollings is this issue now solved by https://github.com/KhronosGroup/MoltenVK/pull/2128 ?

billhollings commented 7 months ago

@billhollings is this issue now solved by #2128 ?

Unlikely. PR #2128 fixed an issue connected to push constants. This issue was triggered by an open vkMemoryMap(). It was mostly solved with PR #1868, and marked so above, but there have been a couple of additional concerns raised above since, so it remains partially open for further evaluation.