NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.22k stars 156 forks source link

Add missing hard algorithms #685

Open jaredhoberock opened 12 years ago

jaredhoberock commented 12 years ago

We're missing these algorithms (which look hard);

partial_sort partial_sort_copy includes nth_element

Forwarded from http://code.google.com/p/thrust/issues/detail?id=423

dkolsen-pgi commented 3 years ago

NVC++ needs implementations of these algorithms.

bdice commented 2 years ago

Implementing a partial_sort algorithm would be one route for RAPIDS libcudf to support "top K" queries as described here: https://github.com/dask-contrib/dask-sql/issues/818. The current implementation of nsmallest / nlargest is to perform a full sort of indices, slice those indices to the desired subset, and gather the results. A partial sort could theoretically reduce the work of the full sort, but it is probably hard to do work-reduction when also computing in parallel.

Also cc-ing @mythrocks for interest in nth_element.

upsj commented 2 years ago

I implemented a version of nth_element a while ago (paper, code) whose primitives would also be relevant for partial_sort. Some notes on how I would approach this:

On the individual kernels:

For arbitrarily complex user-provided types or comparators, most of the optimizations in here will probably not work well, since they rely heavily on fast shared memory. Also the randomized nature of the algorithm is a bit of a challenge. For even more randomization, we can also pick only two splitters from the sample that enclose the sample quantile we are looking for, and copy only those elements inside via a tripartition. The resulting algorithm may be much faster, but also has a worse worst-case complexity. We could of course always fall back to a full sorting algorithm if the recursion doesn't sufficiently reduce the input size.