gfx-rs / wgpu

A cross-platform, safe, pure-Rust graphics API.
https://wgpu.rs
Apache License 2.0
12.4k stars 908 forks source link

Performance of GPUBufferUsage.STORAGE buffers #945

Closed dbronnik closed 3 years ago

dbronnik commented 4 years ago

I've tried to follow the JS example at https://developers.google.com/web/updates/2019/08/get-started-with-gpu-compute-on-the-web to just copy data in and out without doing any compute. One input buffer and one output buffer. I've started increasing the size of the buffers to hundreds of MB and, from some point on, the time spent started to grow linearly with the size of the buffers at an effective rate of ~1GB/s. For example, with 1.28GB buffers (two of them so 2.56GB worth of data) it took 2.2 seconds. Is there a better way to do it if all I want is to upload/download data or there's a performance problem with mapped buffers?

The above numbers exclude the cost of createBuffer({mappedAtCreation: true, size = 1.28GB, usage: STORAGE}). That call alone is 400ms.

I tried this on MacBook Pro 13" with an Intel Iris, MacBook Pro 16" with a radeon and a Windows desktop with an Nvidia. Similar results on all of them.

Here are the transfer rates CPU<->GPU using an OpenCL benchmark on MacBook 13":

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 8.56
  enqueueReadBuffer               : 7.89
  enqueueWriteBuffer non-blocking : 8.20
  enqueueReadBuffer non-blocking  : 8.47
  enqueueMapBuffer(for read)      : 53419.99
    memcpy from mapped ptr        : 8.13
  enqueueUnmap(after write)       : 11374.38
    memcpy to mapped ptr          : 8.24
kvark commented 4 years ago

Thank you for filing! Could you share a repro case, or record an API trace? The next step would be capturing with XCode and seeing what's going on there.

dbronnik commented 4 years ago

// Here's the case. "n" is set to 256 mil elements, so 1.28G buffers. // If glsl code isn't submitted/executed at all the elapsed time is unchanged. // Thank you for looking at the issue.

import glslangModule from "https://unpkg.com/@webgpu/glslang@0.0.15/dist/web-devel/glslang.js";

export async function square() {

let m = 256000; let n = m 1000; let srcBuffer = new ArrayBuffer(n 4); let src = new Float32Array(srcBuffer);

for (let ii = 0; ii < n; ++ii) { src[ii] = ii; }

const adapter = await navigator.gpu.requestAdapter(); const device = await adapter.requestDevice();

let t1 = Date.now();

const gpuSrcBuffer = device.createBuffer({ mappedAtCreation: true, size: src.byteLength, usage: GPUBufferUsage.STORAGE });

console.log("createBuffer(mapped) " + (Date.now() - t1)); t1 = Date.now();

const gpuSrcArrayBuffer = gpuSrcBuffer.getMappedRange(); new Float32Array(gpuSrcArrayBuffer).set(src); gpuSrcBuffer.unmap();

const gpuTgtBuffer = device.createBuffer({ size: src.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC });

const bindGroupLayout = device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, type: "readonly-storage-buffer" }, { binding: 1, visibility: GPUShaderStage.COMPUTE, type: "storage-buffer" } ] });

const bindGroup = device.createBindGroup({ layout: bindGroupLayout, entries: [ { binding: 0, resource: { buffer: gpuSrcBuffer } }, { binding: 1, resource: { buffer: gpuTgtBuffer } } ] });

const computeShaderCode = `#version 450

layout(std430, set = 0, binding = 0) readonly buffer Buf0 {
    float data[];
} buf0;

layout(std430, set = 0, binding = 1) buffer Buf1 {
    float data[];
} buf1;

void main() {
  uint offset = gl_GlobalInvocationID.x * 1000;
  for (int ii = 0; ii < 1000; ii++) {
    buf1.data[offset + ii] =
      buf0.data[offset + ii] * buf0.data[offset + ii];
  }
}

`;

const glslang = await glslangModule();
const computePipeline = device.createComputePipeline({ layout: device.createPipelineLayout({ bindGroupLayouts: [bindGroupLayout] }), computeStage: { module: device.createShaderModule({ code: glslang.compileGLSL(computeShaderCode, "compute") }), entryPoint: "main" } });

t1 = Date.now();

const commandEncoder = device.createCommandEncoder();

const passEncoder = commandEncoder.beginComputePass(); passEncoder.setPipeline(computePipeline); passEncoder.setBindGroup(0, bindGroup); passEncoder.dispatch(m); passEncoder.endPass();

const gpuReadBuffer = device.createBuffer({ size: src.byteLength, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });

commandEncoder.copyBufferToBuffer( gpuTgtBuffer, 0 , gpuReadBuffer, 0, src.byteLength );

const gpuCommands = commandEncoder.finish(); device.defaultQueue.submit([gpuCommands]);

await gpuReadBuffer.mapAsync(GPUMapMode.READ); const arrayBuffer = gpuReadBuffer.getMappedRange();

let tgt = new Float32Array(arrayBuffer);

console.log("elapsed " + (Date.now() - t1));

console.log("m = " + m); console.log("n = " + n); console.log(tgt); }

square();

kvark commented 4 years ago

Ah, interesting. So you are filing a bug about Firefox's performance specifically, not wgpu directly. The thing about browsers is that they need to take extra steps because of the process separation between content and GPU. So there is shared memory involved, and generally more copies of data taking place. At this moment, we aren't prioritizing optimizing this - there are more important things on the table.

For example, today in Firefox, when you are using mappedAtCreation, you'd have the following copies:

  1. you writing into the mapped region, which is shared memory
  2. shared memory copied into CPU-visible driver staging memory
  3. staging memory copied into GPU-local memory

Similar picture for reading the data back. So by uploading and then downloading data, there is maybe ~6 copies today, which is very close to explaining the 1Gb/sec.

dbronnik commented 4 years ago

Thanks for looking into it. Is it going to be fixed at some point and how many copies do you expect to persevere? Single threaded memcpy on 2667 MHz DDR4 seems to run slower than the theoretical limit of PCIe3 16x.

createBuffer(mappedAtCreation, size) doesn't have any data at all. It takes 3x more time than the following Float32Array(buffer).set() which is the actual memcpy. What does createBuffer copy?

I'm using Canary, but Firefox is just as important.

kvark commented 4 years ago

createBuffer is where the shared memory is allocated, it's supposed to take time. If you are using Canary, it's twice as strange, since you aren't interacting with wgpu in any way... so why filing an issue here at all?

grovesNL commented 3 years ago

Looks like we don't have anything to do here – closing