Closed dbronnik closed 3 years ago
Thank you for filing! Could you share a repro case, or record an API trace? The next step would be capturing with XCode and seeing what's going on there.
// Here's the case. "n" is set to 256 mil elements, so 1.28G buffers. // If glsl code isn't submitted/executed at all the elapsed time is unchanged. // Thank you for looking at the issue.
import glslangModule from "https://unpkg.com/@webgpu/glslang@0.0.15/dist/web-devel/glslang.js";
export async function square() {
let m = 256000; let n = m 1000; let srcBuffer = new ArrayBuffer(n 4); let src = new Float32Array(srcBuffer);
for (let ii = 0; ii < n; ++ii) { src[ii] = ii; }
const adapter = await navigator.gpu.requestAdapter(); const device = await adapter.requestDevice();
let t1 = Date.now();
const gpuSrcBuffer = device.createBuffer({ mappedAtCreation: true, size: src.byteLength, usage: GPUBufferUsage.STORAGE });
console.log("createBuffer(mapped) " + (Date.now() - t1)); t1 = Date.now();
const gpuSrcArrayBuffer = gpuSrcBuffer.getMappedRange(); new Float32Array(gpuSrcArrayBuffer).set(src); gpuSrcBuffer.unmap();
const gpuTgtBuffer = device.createBuffer({ size: src.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC });
const bindGroupLayout = device.createBindGroupLayout({ entries: [ { binding: 0, visibility: GPUShaderStage.COMPUTE, type: "readonly-storage-buffer" }, { binding: 1, visibility: GPUShaderStage.COMPUTE, type: "storage-buffer" } ] });
const bindGroup = device.createBindGroup({ layout: bindGroupLayout, entries: [ { binding: 0, resource: { buffer: gpuSrcBuffer } }, { binding: 1, resource: { buffer: gpuTgtBuffer } } ] });
const computeShaderCode = `#version 450
layout(std430, set = 0, binding = 0) readonly buffer Buf0 {
float data[];
} buf0;
layout(std430, set = 0, binding = 1) buffer Buf1 {
float data[];
} buf1;
void main() {
uint offset = gl_GlobalInvocationID.x * 1000;
for (int ii = 0; ii < 1000; ii++) {
buf1.data[offset + ii] =
buf0.data[offset + ii] * buf0.data[offset + ii];
}
}
`;
const glslang = await glslangModule();
const computePipeline = device.createComputePipeline({
layout: device.createPipelineLayout({
bindGroupLayouts: [bindGroupLayout]
}),
computeStage: {
module: device.createShaderModule({
code: glslang.compileGLSL(computeShaderCode, "compute")
}),
entryPoint: "main"
}
});
t1 = Date.now();
const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass(); passEncoder.setPipeline(computePipeline); passEncoder.setBindGroup(0, bindGroup); passEncoder.dispatch(m); passEncoder.endPass();
const gpuReadBuffer = device.createBuffer({ size: src.byteLength, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ });
commandEncoder.copyBufferToBuffer( gpuTgtBuffer, 0 , gpuReadBuffer, 0, src.byteLength );
const gpuCommands = commandEncoder.finish(); device.defaultQueue.submit([gpuCommands]);
await gpuReadBuffer.mapAsync(GPUMapMode.READ); const arrayBuffer = gpuReadBuffer.getMappedRange();
let tgt = new Float32Array(arrayBuffer);
console.log("elapsed " + (Date.now() - t1));
console.log("m = " + m); console.log("n = " + n); console.log(tgt); }
square();
Ah, interesting. So you are filing a bug about Firefox's performance specifically, not wgpu
directly. The thing about browsers is that they need to take extra steps because of the process separation between content and GPU. So there is shared memory involved, and generally more copies of data taking place. At this moment, we aren't prioritizing optimizing this - there are more important things on the table.
For example, today in Firefox, when you are using mappedAtCreation
, you'd have the following copies:
Similar picture for reading the data back. So by uploading and then downloading data, there is maybe ~6 copies today, which is very close to explaining the 1Gb/sec.
Thanks for looking into it. Is it going to be fixed at some point and how many copies do you expect to persevere? Single threaded memcpy on 2667 MHz DDR4 seems to run slower than the theoretical limit of PCIe3 16x.
createBuffer(mappedAtCreation, size) doesn't have any data at all. It takes 3x more time than the following Float32Array(buffer).set() which is the actual memcpy. What does createBuffer copy?
I'm using Canary, but Firefox is just as important.
createBuffer
is where the shared memory is allocated, it's supposed to take time.
If you are using Canary
, it's twice as strange, since you aren't interacting with wgpu
in any way... so why filing an issue here at all?
Looks like we don't have anything to do here – closing
I've tried to follow the JS example at https://developers.google.com/web/updates/2019/08/get-started-with-gpu-compute-on-the-web to just copy data in and out without doing any compute. One input buffer and one output buffer. I've started increasing the size of the buffers to hundreds of MB and, from some point on, the time spent started to grow linearly with the size of the buffers at an effective rate of ~1GB/s. For example, with 1.28GB buffers (two of them so 2.56GB worth of data) it took 2.2 seconds. Is there a better way to do it if all I want is to upload/download data or there's a performance problem with mapped buffers?
The above numbers exclude the cost of createBuffer({mappedAtCreation: true, size = 1.28GB, usage: STORAGE}). That call alone is 400ms.
I tried this on MacBook Pro 13" with an Intel Iris, MacBook Pro 16" with a radeon and a Windows desktop with an Nvidia. Similar results on all of them.
Here are the transfer rates CPU<->GPU using an OpenCL benchmark on MacBook 13":