I would go for a data/read inefficient, static parallelization. But this might not be faster in the end.:
Each block updates only on one data point in the output.
The stencil is loaded in the shared memory.
Sorting is performed via threads.
The data to sort should be twice as large as the number of threads used for sorting.
Thus initially each thread load to data values into shared memory.
Threads select two adjacent data points and compare them.
The swap the data points if the larger has a lower index.
If a swap occured, the thread writes a True to a shared bool value (this can be atomic, but does not have to be).
If the bool is true, the bool is set to false and the procedure is repeated with a shifted read by the threads until the shared memory is sorted and the shared bool stays false.
If the shared bool stayed false, the thread with id 0 determines the median value and writes in the global output.
Hi,
We are trying to implement a parallel sorting algorithm to improve the median filter.
So in each thread, we need to sort an array, and if we want to sort them in parallel, it means in each thread we launch a sub-kernel to sort.
neither quicksort: https://stackoverflow.com/questions/14068263/cuda-quicksort-in-kernel-recall
nor odd-even sort: https://devtalk.nvidia.com/default/topic/394362/faulty-sort-algorithm-please-help-33-odd-even-sort/B
Both need to launch sub-kernel. This behavior is so-called "Dynamic Parallelism",
But unfortunately, Numba doesn't support this feature: https://numba.pydata.org/numba-doc/dev/cuda/kernels.html#kernel-declaration
So is there any other idea for parallel sorting algorithm?