NVIDIA / cccl

CUDA C++ Core Libraries
Other
939 stars 116 forks source link

Quickselect algorithm to select kth smallest element #931

Open mfbalin opened 1 year ago

mfbalin commented 1 year ago

I think it would be valuable for the users (at least for me) to have a quickselect algorithm (or any alternative algorithm) to find the kth smallest element in a given array (and a segmented version). We can currently achieve this by using sort and segmented sort but the sorting cost is likely higher than determining the kth smallest element using a more specific algorithm for the task.

gevtushenko commented 1 year ago

@mfbalin thank you for reaching out to us with your suggestion!

I agree that it'd be a valuable addition to CUB. At some point, I was going to give the following idea a try. We would be more than happy to accept this functionality as a contribution.

mfbalin commented 1 year ago

@senior-zero It looks like the algorithm in the paper you linked is analogous to std::partial_sort. What I had in mind is actually std::nth_element. Unless Dr. Topk is the current SOTA algorithm for the nth_element problem as well.

EDIT: I am actually not quite sure about whether Dr. Topk provides its output in a sorted or arbitrary manner. EDIT2: Looks like Dr. Top-k performs nth_element after which one can use a sort to get sorted output.

mfbalin commented 1 year ago

Some weighted sampling algorithms would benefit from this algorithm as they require picking the largest n elements from an array.

elstehle commented 1 year ago

As cross-reference, we have a related issue in thrust: https://github.com/NVIDIA/cccl/issues/685

Some related work:

Just to also throw my hat in the ring; I think we should also consider a partition-based radix sort, like the one outlined here. After each partitioning pass, the sort will only continue partitioning the bins that are actually relevant. This paper has shown that such an approach would often outperform the specialized top-k implementation for k larger than 256. The benefit is that we can implement many algorithms, like partial_sort, top-k, nth-element, and multi-GPU sorting using just one implementation with specialized code paths.

wence- commented 3 months ago

Just to add some more API requests into the mix. From the libcudf side of things, we would want to do this on non fixed-width data types (e.g. string columns) so would probably want _by_key variants of the algorithms to select the indices that, when gathered, produce the top-k entries. Segmented variants as well would be delightful.

mfbalin commented 2 months ago

@elstehle When do you estimate CCCL will have a working implementation of the segmented version? And how much speedup over segmented sorting do you expect if k <= 30 and segment sizes vary between [1-1000], possibly with a few even larger segments? (Given 10000-100000 segments)

mfbalin commented 2 months ago

@elstehle I have a workload that directly depends on this feature. I could help with this effort in my free time if it can help speed up the process.

elstehle commented 2 months ago

Thank you for your interest in the algorithm and for offering to contribute, @mfbalin. Top-k is definitely high up on our list of things to work on and work is slated to start in early May.

And how much speedup over segmented sorting do you expect if k <= 30 and segment sizes vary between [1-1000], possibly with a few even larger segments? (Given 10000-100000 segments)

This sounds like you are interested in a very specific workload, with segments small enough to fit into on-chip memory (registers / shared memory). The solution devised for such a workload will look quite different from one where you are running top-k on a single large input. I strongly assume that our initial top-k implementation will focus and optimize for a single large input, where the performance gains mostly come from savings in memory traffic. I assume that relative performance gains for the workload you outlined won't be as drastic compared to the relative performacne gains that we will see for large inputs. I think that we will get to optimizing for workloads with small segments that fit into on-chip memory only at a later point.

If you're keen on contributing while, at the same time, taking steps to enabling your use case, I suggest starting by experimenting with and evaluating different approaches for handling small segments. Sharing your findings with us would be a valuable first step.