Open LRLVEC opened 8 months ago
Hi, sorry for replying this late, I've been really busy the last few months. The scan algorithm with decoupled look-back sounds promising. I'll give this a try (hopefully in the next few weeks). Thanks for suggesting this!
There's also radix sort from the Fuchsia project: https://github.com/juliusikkala/fuchsia_radix_sort
Benchmarks are impressive.
According to my test compared with cub device radix sort, the speed of this implemention is about 3 times slower than cub for 16<<20 uint32_t elements, which is about 4ms vs 1.3ms on RTX4090.
As far as I know, cub uses decoupled look back to improve the scan operation speed. Any interest on making this more efficient by switching to the sota scan algorithm?