Incorrect result for radix sort

h-a-n-n-e-s commented 10 months ago

@fynv When I run this in Chrome 119 (MacOS 14.1), I get the output shown below. Can this be fixed?

fynv commented 10 months ago

There's a known issue with Mac backends. Actually we cannot use the algorithm here (single-pass prefix-sum). Instead, a multi-pass prefix-sum algorithm has to be used to get correct result, which is slower.

h-a-n-n-e-s commented 10 months ago

Thanks for the fast reply. Now I read about it, too bad that the Metal API doesn't support this! I currently try to implement a parallel hash grid generator using WebGPU...

h-a-n-n-e-s commented 10 months ago

Is there a WebGPU implementation of a multi-pass prefix-sum algorithm publicly available somewhere?

fynv commented 10 months ago

I check-in a "prefix_sum_m.js" which is multi-pass prefix-sum. Need some more work to use it for radix sort.

h-a-n-n-e-s commented 10 months ago

Thanks a lot!

On Tue, Nov 7, 2023 at 18:16 Fei Yang @.***> wrote:

I check-in a "prefix_sum_m.js" which is multi-pass prefix-sum. Need some more work to use it for radix sort.

— Reply to this email directly, view it on GitHub https://github.com/fynv/webgpu_math/issues/1#issuecomment-1798101255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEY4VLJYSMV6GWUTKHLZHU3YDH37NAVCNFSM6AAAAAA67NGYKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJYGEYDCMRVGU . You are receiving this because you authored the thread.Message ID: @.***>

fynv commented 10 months ago

I've just uploaded radix_sort_m.js. This should work Mac now.

h-a-n-n-e-s commented 10 months ago

Thank you! I will check it out tonight.

h-a-n-n-e-s commented 10 months ago

I just did some performance tests using HpcAlgorithms.Sorting.RadixSortLsdUInt32() from here for the cpu case. The result is:

count= 32**3, cpu=0.8ms, gpu=11ms
count= 64**3, cpu=  6ms, gpu=13ms
count=128**3, cpu= 47ms, gpu=36ms

Is this to be expected? Is the slow gpu for a low count due to the JS overhead of creating the pipelines and bind groups etc.? Or is it the inherent overhead of WebGPU talking to the gpu?

fynv commented 10 months ago

Thanks for sharing your result! I haven't done any performance study yet. It is quite frastrating we can only use the multipass prefix-sum, which requires quite a lot of compute-dispatches, and IMO big overhead is somehow expected.

Also be careful of your way of timing GPU..

Context initialization time can be quite significant:

const engine_ctx = new EngineContext();
await engine_ctx.initialize();

GPU -> CPU data transfer is known to have long latency:

await buf_download.mapAsync(GPUMapMode.READ);
let buf = buf_download.getMappedRange();
hOutput.set(new Int32Array(buf));
buf_download.unmap();

Also the allocations and CPU -> GPU data transfer. In real applications, we may only need to allocate once.

fynv commented 10 months ago

I just took a quick look.. It seems timing in WebGPU is tricky. Haven't figured out what is the best practice. ~~There isn't a CPU API for waiting a GPUQueue to finish all tasks~~.

However, I believe things like "await buf_download.mapAsync(GPUMapMode.READ);" do have an implicit syncing effect.

fynv commented 10 months ago

Ah there is: GPUQueue: onSubmittedWorkDone() which I missed. This should be used for timing if you want to exclude the GPU->CPU transfer time

h-a-n-n-e-s commented 10 months ago

Thanks for sharing your insights! Context initialization was already excluded from timing. Now I also excluded GPU -> CPU data transfer. I attached screenshots from where I start and end the timing. The timings change to:

count= 32**3, cpu=0.8ms, gpu=5ms
count= 64**3, cpu=  6ms, gpu=5ms
count=128**3, cpu= 47ms, gpu=5ms

Wow! This shows the remaining CPU -> GPU overhead is essentially dominating! Your actual shader code is very performant!

h-a-n-n-e-s commented 10 months ago

Ah, I see, I'm stupid! :D I will use onSubmittedWorkDone()

h-a-n-n-e-s commented 10 months ago

Here are the timings:

count= 32**3, cpu=0.8ms, gpu= 7ms
count= 64**3, cpu=  6ms, gpu= 8ms
count=128**3, cpu= 47ms, gpu=15ms

So the remaining CPU -> GPU overhead is about 7 ms!

fynv commented 10 months ago

Thanks! I see.

h-a-n-n-e-s commented 10 months ago

I will try to understand a bit more of your implementation and will then incorporate it into my code. Will get back to you once I have new performance results. Thanks so much for your support!!!

fynv commented 10 months ago

FYI, I ported the CUDA sample "smokeParticles" to WebGPU:

https://github.com/fynv/websmoke https://fynv.github.io/websmoke/client/ Here radix-sort is used to sort the particles for correct lighting.

The actual sorting code is slightly modified to adapt to the applicaion: https://github.com/fynv/websmoke/blob/master/client/radix_sort.js

h-a-n-n-e-s commented 10 months ago

Thanks! I get this error:

Screenshot 2023-11-14 at 0 39 03

fynv commented 10 months ago

Try disabling the browser extension "webgpu-devtools"? The error seems to happen at that side.

h-a-n-n-e-s commented 10 months ago

Looks really nice!

fynv / webgpu_math

Incorrect result for radix sort #1