Closed larsjarlvik closed 3 years ago
The only sensible explanation here is that we are picking the wrong memory type for the readback staging buffer. Our memory allocation goes through gpu-alloc, so getting the properties of the memory type and inspecting them would be a good first step. If you can try debugging it, starting with this spot and at least report the used flags, that would be very helpful!
These are the flags being set:
UsageFlags::FAST_DEVICE_ACCESS
UsageFlags::HOST_ACCESS
UsageFlags::DOWNLOAD
The next question would be, what memory type is picked by gpu-alloc on your machine based on these flags? We'd want to know the properties of this type.
Hi, sorry for the slow reply, I ran the application on my laptop which has both nvidia and intel GPU's and this is the debug output for each of them:
Nvidia ti1050 (~3800ms):
[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
memory_types: [
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: DEVICE_LOCAL,
heap_index: 0,
},
MemoryType {
properties: CPU_VISIBLE | COHERENT,
heap_index: 1,
},
MemoryType {
properties: CPU_VISIBLE | COHERENT | CPU_CACHED,
heap_index: 1,
},
MemoryType {
properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
heap_index: 2,
},
],
memory_heaps: [
MemoryHeap {
size: 4214226944,
flags: DEVICE_LOCAL,
},
MemoryHeap {
size: 16954212352,
flags: (empty),
},
MemoryHeap {
size: 257949696,
flags: DEVICE_LOCAL,
},
],
}
Intel iGPU (~600ms):
[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
memory_types: [
MemoryType {
properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
heap_index: 0,
},
MemoryType {
properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT | CPU_CACHED,
heap_index: 0,
},
MemoryType {
properties: DEVICE_LOCAL,
heap_index: 0,
},
],
memory_heaps: [
MemoryHeap {
size: 17088432128,
flags: DEVICE_LOCAL,
},
],
}
Adding the results for my desktop computer (RTX 2070). It takes ~9000ms to run the compute shader, hope this is what you're looking for:
[wgpu-core-0.7.0\src\instance.rs:485] &mem_props = MemoryProperties {
memory_types: [
MemoryType {
properties: (empty),
heap_index: 1,
},
MemoryType {
properties: DEVICE_LOCAL,
heap_index: 0,
},
MemoryType {
properties: CPU_VISIBLE | COHERENT,
heap_index: 1,
},
MemoryType {
properties: CPU_VISIBLE | COHERENT | CPU_CACHED,
heap_index: 1,
},
MemoryType {
properties: DEVICE_LOCAL | CPU_VISIBLE | COHERENT,
heap_index: 2,
},
],
memory_heaps: [
MemoryHeap {
size: 8421113856,
flags: DEVICE_LOCAL,
},
MemoryHeap {
size: 17149771776,
flags: (empty),
},
MemoryHeap {
size: 257949696,
flags: DEVICE_LOCAL,
},
],
}
@larsjarlvik thank you, this is very helpful!
Here is what I found so far:
FAST_DEVICE_ACCESS
alongside the other hints4
) over the othersCPU_CACHED
because it wasn't a priorityReason because we request FAST_DEVICE_ACCESS
is here:
let map_copy_flags =
desc.usage & (Bu::MAP_READ | Bu::MAP_WRITE | Bu::COPY_SRC | Bu::COPY_DST);
if map_flags.is_empty() || !(desc.usage - map_copy_flags).is_empty() {
flags |= Uf::FAST_DEVICE_ACCESS;
}
If I understand correctly, here is the buffer creation in your code:
wgpu::BufferUsage::VERTEX | wgpu::BufferUsage::STORAGE | wgpu::BufferUsage::MAP_READ | wgpu::BufferUsage::COPY_DST,
So you are enabling the MAPPED_PRIMARY_BUFFERS
feature (which is non-portable to the web), and you are working with a buffer that needs to be GPU-visible and also readable. Some platforms simply don't have this kind of buffer memory, and that's what you see in the numbers.
My strong recommendation is to stick with minimal requested features, and be especially careful about native-only features, since they aren't considered/discussed by WebGPU group. For this case of yours, you'd create a staging buffer for downloading data as a separate buffer from the one you use for STORAGE | VERTEX
on GPU.
If try to think what can be done here, nothing particularly interesting comes to mind. Maybe @cwfitzgerald has other ideas.
Thanks for the very detailed response!
Yes I am using the buffer for both VERTEX
and STORAGE
since I need to know the elevations at points in my terrain and using it for render. It shouldn't be too hard to refactor this and I think I might have to no matter what. I will update the thread with the results once it's done.
To clarify, VERTEX | STORAGE
is totally fine. It's mixing them with MAP_XX
that is not fine.
From what I've read and what I've experienced, there is basically no real reason to use a mapped primary buffer on discrete. There will need to be a copy either way, so you might as well make it explicit in a command. When you're on UMA system where MAPPABLE_PRIMARY_BUFFERS is useful, there's only one physical kind of memory, so it doesn't matter which logical one the allocator chooses.
In this way, I recommend heading the warning that prints in the logs when you enable MAPPED_PRIMARY_BUFFERS on a discrete card, it's quite a bit of a footgun.
That's a healthy idea! May need some plumbing though, but I agree it would be very useful.
Hi, the performance issue I had is now solved, huge thanks for all the help @kvark and sorry if I wasted a bit of your time. The work on wgpu-rs is awesome and I look forward to see what's coming :)
Here's the commit that fixed my issue (together with some small cleanups) for reference in case someone else is running into the same issue: https://github.com/larsjarlvik/wgpu-rs/commit/68671ce5ce3cc25c4e6fdea666aa50d6b7f9e7d6
Hi, I'm using a compute shader to calculate height and normals for my procedural terrain, it returns array of vertices with a length of 6558721 items, looking like this:
I've tried this on 4 separate machines/configurations and the time to run it varies wildly, here's the results I got:
The code I'm using to read back the data looks like this:
It it's
vertices.to_vec();
that is taking time on some machines, the compute shader only takes ~500ms. But I suspect it's something under the hood that happens when I callto_vec()
that takes time. I also noticed that the time it takes is linear with the length of the array (mapping half of it takes about half the time).This is my first project in rust and wgpu-rs so my debugging skills of this kind of issue is a bit limited. Anyone have any idea on what's causing this or can give me some pointers in how find the culprit?
You can find the full source code here: https://github.com/larsjarlvik/wgpu-rs/blob/performance/src/pipelines/terrain/compute.rs#L156