MircoWerner / VkRadixSort

GPU Radix Sort implemented in Vulkan and GLSL.
MIT License
40 stars 4 forks source link

Compared to cub radix sort #2

Open LRLVEC opened 8 months ago

LRLVEC commented 8 months ago

According to my test compared with cub device radix sort, the speed of this implemention is about 3 times slower than cub for 16<<20 uint32_t elements, which is about 4ms vs 1.3ms on RTX4090.

As far as I know, cub uses decoupled look back to improve the scan operation speed. Any interest on making this more efficient by switching to the sota scan algorithm?

MircoWerner commented 5 months ago

Hi, sorry for replying this late, I've been really busy the last few months. The scan algorithm with decoupled look-back sounds promising. I'll give this a try (hopefully in the next few weeks). Thanks for suggesting this!

ib00 commented 5 months ago

There's also radix sort from the Fuchsia project: https://github.com/juliusikkala/fuchsia_radix_sort

Benchmarks are impressive.