Reading back result from compute shader is slow on some devices

larsjarlvik commented 3 years ago

Hi, I'm using a compute shader to calculate height and normals for my procedural terrain, it returns array of vertices with a length of 6558721 items, looking like this:

pub struct Vertex {
    pub position: [f32; 3],
    pub normal: [f32; 3],
}

I've tried this on 4 separate machines/configurations and the time to run it varies wildly, here's the results I got:

Desktop (Ryzen 3800X, RTX2070) = ~9000ms
Laptop 1 (i7-8850h, 1050ti) = ~3800ms
Laptop 2 (i7-8750h, MX130) = ~1800ms
Laptop2 (running on the Intel IGPU) = ~600ms

The code I'm using to read back the data looks like this:

let buffer_slice = buffer.clone().slice(..);
let buffer_future = buffer_slice.map_async(wgpu::MapMode::Read);
device.poll(wgpu::Maintain::Wait);

if let Ok(()) = buffer_future.await {
    let data = buffer_slice.get_mapped_range();

    let vertices = unsafe { data.align_to::<Vertex>().1 };
    let vec_vertices = vertices.to_vec();
    let data = plane::from_data(vec_vertices, size);
    data
} else {
    panic!("Failed to generate terrain!")
}

It it's vertices.to_vec(); that is taking time on some machines, the compute shader only takes ~500ms. But I suspect it's something under the hood that happens when I call to_vec() that takes time. I also noticed that the time it takes is linear with the length of the array (mapping half of it takes about half the time).

This is my first project in rust and wgpu-rs so my debugging skills of this kind of issue is a bit limited. Anyone have any idea on what's causing this or can give me some pointers in how find the culprit?

You can find the full source code here: https://github.com/larsjarlvik/wgpu-rs/blob/performance/src/pipelines/terrain/compute.rs#L156

kvark commented 3 years ago

The only sensible explanation here is that we are picking the wrong memory type for the readback staging buffer. Our memory allocation goes through gpu-alloc, so getting the properties of the memory type and inspecting them would be a good first step. If you can try debugging it, starting with this spot and at least report the used flags, that would be very helpful!

larsjarlvik commented 3 years ago

These are the flags being set:

UsageFlags::FAST_DEVICE_ACCESS
UsageFlags::HOST_ACCESS
UsageFlags::DOWNLOAD

kvark commented 3 years ago

The next question would be, what memory type is picked by gpu-alloc on your machine based on these flags? We'd want to know the properties of this type.

larsjarlvik commented 3 years ago

Hi, sorry for the slow reply, I ran the application on my laptop which has both nvidia and intel GPU's and this is the debug output for each of them:

Nvidia ti1050 (~3800ms):

[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
    memory_types: [
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: DEVICE_LOCAL,
            heap_index: 0,
        },
        MemoryType {
            properties: CPU_VISIBLE | COHERENT,
            heap_index: 1,
        },
        MemoryType {
            properties: CPU_VISIBLE | COHERENT | CPU_CACHED,
            heap_index: 1,
        },
        MemoryType {
            properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
            heap_index: 2,
        },
    ],
    memory_heaps: [
        MemoryHeap {
            size: 4214226944,
            flags: DEVICE_LOCAL,
        },
        MemoryHeap {
            size: 16954212352,
            flags: (empty),
        },
        MemoryHeap {
            size: 257949696,
            flags: DEVICE_LOCAL,
        },
    ],
}

Intel iGPU (~600ms):

[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
    memory_types: [
        MemoryType {
            properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
            heap_index: 0,
        },
        MemoryType {
            properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT | CPU_CACHED,
            heap_index: 0,
        },
        MemoryType {
            properties: DEVICE_LOCAL,
            heap_index: 0,
        },
    ],
    memory_heaps: [
        MemoryHeap {
            size: 17088432128,
            flags: DEVICE_LOCAL,
        },
    ],
}

larsjarlvik commented 3 years ago

Adding the results for my desktop computer (RTX 2070). It takes ~9000ms to run the compute shader, hope this is what you're looking for:

[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
    memory_types: [
        MemoryType {
            properties: (empty),
            heap_index: 1,
        },
        MemoryType {
            properties: DEVICE_LOCAL,
            heap_index: 0,
        },
        MemoryType {
            properties: CPU_VISIBLE | COHERENT,
            heap_index: 1,
        },
        MemoryType {
            properties: CPU_VISIBLE | COHERENT | CPU_CACHED,
            heap_index: 1,
        },
        MemoryType {
            properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
            heap_index: 2,
        },
    ],
    memory_heaps: [
        MemoryHeap {
            size: 8421113856,
            flags: DEVICE_LOCAL,
        },
        MemoryHeap {
            size: 17149771776,
            flags: (empty),
        },
        MemoryHeap {
            size: 257949696,
            flags: DEVICE_LOCAL,
        },
    ],
}

kvark commented 3 years ago

@larsjarlvik thank you, this is very helpful!

Here is what I found so far:

we request the FAST_DEVICE_ACCESS alongside the other hints
the gpu-alloc logic appears to prioritize this hint heavily (factor of 4) over the others
we end up with a memory type that's not CPU_CACHED because it wasn't a priority

Reason because we request FAST_DEVICE_ACCESS is here:

let map_copy_flags =
                desc.usage & (Bu::MAP_READ | Bu::MAP_WRITE | Bu::COPY_SRC | Bu::COPY_DST);
            if map_flags.is_empty() || !(desc.usage - map_copy_flags).is_empty() {
                flags |= Uf::FAST_DEVICE_ACCESS;
            }

If I understand correctly, here is the buffer creation in your code:

wgpu::BufferUsage::VERTEX | wgpu::BufferUsage::STORAGE | wgpu::BufferUsage::MAP_READ | wgpu::BufferUsage::COPY_DST,

So you are enabling the MAPPED_PRIMARY_BUFFERS feature (which is non-portable to the web), and you are working with a buffer that needs to be GPU-visible and also readable. Some platforms simply don't have this kind of buffer memory, and that's what you see in the numbers.

My strong recommendation is to stick with minimal requested features, and be especially careful about native-only features, since they aren't considered/discussed by WebGPU group. For this case of yours, you'd create a staging buffer for downloading data as a separate buffer from the one you use for STORAGE | VERTEX on GPU.

If try to think what can be done here, nothing particularly interesting comes to mind. Maybe @cwfitzgerald has other ideas.

larsjarlvik commented 3 years ago

Thanks for the very detailed response!

Yes I am using the buffer for both VERTEX and STORAGE since I need to know the elevations at points in my terrain and using it for render. It shouldn't be too hard to refactor this and I think I might have to no matter what. I will update the thread with the results once it's done.

kvark commented 3 years ago

To clarify, VERTEX | STORAGE is totally fine. It's mixing them with MAP_XX that is not fine.

cwfitzgerald commented 3 years ago

From what I've read and what I've experienced, there is basically no real reason to use a mapped primary buffer on discrete. There will need to be a copy either way, so you might as well make it explicit in a command. When you're on UMA system where MAPPABLE_PRIMARY_BUFFERS is useful, there's only one physical kind of memory, so it doesn't matter which logical one the allocator chooses.

In this way, I recommend heading the warning that prints in the logs when you enable MAPPED_PRIMARY_BUFFERS on a discrete card, it's quite a bit of a footgun.

kvark commented 3 years ago

That's a healthy idea! May need some plumbing though, but I agree it would be very useful.

larsjarlvik commented 3 years ago

Hi, the performance issue I had is now solved, huge thanks for all the help @kvark and sorry if I wasted a bit of your time. The work on wgpu-rs is awesome and I look forward to see what's coming :)

Here's the commit that fixed my issue (together with some small cleanups) for reference in case someone else is running into the same issue: https://github.com/larsjarlvik/wgpu-rs/commit/68671ce5ce3cc25c4e6fdea666aa50d6b7f9e7d6

gfx-rs / wgpu-rs

Reading back result from compute shader is slow on some devices #802