Open dexter2206 opened 10 months ago
By the way, I just checked with clang++
and libomp
instead of g++
and libgomp
, so the problem seems to not be tied to a specific compiler/openmp implementation.
Hello @dexter2206 and thank you for reporting the issue!
I can reproduce it on a Threadripper:
std took: 34s
thrust took: 80s
Looking at the implementation, the issue is related to the reduction of the number of threads that perform merging. Only 3 seconds out of 80 are spent in the sorting of thread partitions. At the final step, merging is done serially by a single thread.
I think we should experiment with merge path approach, so that all threads are utilized during the merge phases. Apart from that, we should give radix sort a try.
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
Performance of
thrust::sort
with OpenMP backend seems to be far worse than even standardstd::sort
. The code provided in "How to reproduce" section compares timings of sorting withstd
andthrust
. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case,std::sort
manages to sort100 000 000
floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.How to Reproduce
sorting_comparison.cpp
:include <thrust/sort.h>
include <thrust/device_vector.h>
using namespace std; using namespace std::chrono;
const int VEC_SIZE = 100'000'000;
vector random_vector(int size) {
vector data(size);
random_device rnd_device;
mt19937 mersenne_engine{rnd_device()};
uniform_real_distribution dist{-10.0f, 10.0f};
}
int main() { vector data = random_vector(VEC_SIZE);
thrust::device_vector data_for_thrust(data);
}
./sorting-comparison
, it should print timings forstd::sort
andthrust::sort
to stdout.Expected behavior
I would expect
thrust::sort
to be at least as fast asstd::sort
.Reproduction link
No response
Operating System
Fedora 38
nvidia-smi output
N/A
NVCC version
N/A