Open h-a-n-n-e-s opened 10 months ago
There's a known issue with Mac backends. Actually we cannot use the algorithm here (single-pass prefix-sum). Instead, a multi-pass prefix-sum algorithm has to be used to get correct result, which is slower.
Thanks for the fast reply. Now I read about it, too bad that the Metal API doesn't support this! I currently try to implement a parallel hash grid generator using WebGPU...
Is there a WebGPU implementation of a multi-pass prefix-sum algorithm publicly available somewhere?
I check-in a "prefix_sum_m.js" which is multi-pass prefix-sum. Need some more work to use it for radix sort.
Thanks a lot!
On Tue, Nov 7, 2023 at 18:16 Fei Yang @.***> wrote:
I check-in a "prefix_sum_m.js" which is multi-pass prefix-sum. Need some more work to use it for radix sort.
— Reply to this email directly, view it on GitHub https://github.com/fynv/webgpu_math/issues/1#issuecomment-1798101255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEY4VLJYSMV6GWUTKHLZHU3YDH37NAVCNFSM6AAAAAA67NGYKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJYGEYDCMRVGU . You are receiving this because you authored the thread.Message ID: @.***>
I've just uploaded radix_sort_m.js. This should work Mac now.
Thank you! I will check it out tonight.
I just did some performance tests using HpcAlgorithms.Sorting.RadixSortLsdUInt32()
from here for the cpu case. The result is:
count= 32**3, cpu=0.8ms, gpu=11ms
count= 64**3, cpu= 6ms, gpu=13ms
count=128**3, cpu= 47ms, gpu=36ms
Is this to be expected? Is the slow gpu for a low count due to the JS overhead of creating the pipelines and bind groups etc.? Or is it the inherent overhead of WebGPU talking to the gpu?
Thanks for sharing your result! I haven't done any performance study yet. It is quite frastrating we can only use the multipass prefix-sum, which requires quite a lot of compute-dispatches, and IMO big overhead is somehow expected.
Also be careful of your way of timing GPU..
Context initialization time can be quite significant:
const engine_ctx = new EngineContext();
await engine_ctx.initialize();
GPU -> CPU data transfer is known to have long latency:
await buf_download.mapAsync(GPUMapMode.READ);
let buf = buf_download.getMappedRange();
hOutput.set(new Int32Array(buf));
buf_download.unmap();
Also the allocations and CPU -> GPU data transfer. In real applications, we may only need to allocate once.
I just took a quick look.. It seems timing in WebGPU is tricky.
Haven't figured out what is the best practice.
There isn't a CPU API for waiting a GPUQueue to finish all tasks.
However, I believe things like "await buf_download.mapAsync(GPUMapMode.READ);" do have an implicit syncing effect.
Ah there is: GPUQueue: onSubmittedWorkDone() which I missed. This should be used for timing if you want to exclude the GPU->CPU transfer time
Thanks for sharing your insights! Context initialization was already excluded from timing. Now I also excluded GPU -> CPU data transfer. I attached screenshots from where I start and end the timing. The timings change to:
count= 32**3, cpu=0.8ms, gpu=5ms
count= 64**3, cpu= 6ms, gpu=5ms
count=128**3, cpu= 47ms, gpu=5ms
Wow! This shows the remaining CPU -> GPU overhead is essentially dominating! Your actual shader code is very performant!
Ah, I see, I'm stupid! :D
I will use
onSubmittedWorkDone()
Here are the timings:
count= 32**3, cpu=0.8ms, gpu= 7ms
count= 64**3, cpu= 6ms, gpu= 8ms
count=128**3, cpu= 47ms, gpu=15ms
So the remaining CPU -> GPU overhead is about 7 ms!
Thanks! I see.
I will try to understand a bit more of your implementation and will then incorporate it into my code. Will get back to you once I have new performance results. Thanks so much for your support!!!
FYI, I ported the CUDA sample "smokeParticles" to WebGPU:
https://github.com/fynv/websmoke https://fynv.github.io/websmoke/client/ Here radix-sort is used to sort the particles for correct lighting.
The actual sorting code is slightly modified to adapt to the applicaion: https://github.com/fynv/websmoke/blob/master/client/radix_sort.js
Thanks! I get this error:
Try disabling the browser extension "webgpu-devtools"? The error seems to happen at that side.
Looks really nice!
@fynv When I run this in Chrome 119 (MacOS 14.1), I get the output shown below. Can this be fixed?