NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.25k stars 161 forks source link

[BUG]: Low performance of sorting with OMP backend #1244

Open dexter2206 opened 10 months ago

dexter2206 commented 10 months ago

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Performance of thrust::sort with OpenMP backend seems to be far worse than even standard std::sort. The code provided in "How to reproduce" section compares timings of sorting with std and thrust. I tested it on two machines, one with AMD Ryzen 5950X and 32GB DDR4 memory, and the other with Intel i7-1365U and 32GB DDR5 memory, both on Fedora 38 operating system. In the first case, std::sort manages to sort 100 000 000 floats in 32s and thrust performs the same task in ~87s. In the second case, the timings are ~31s and ~117s respectively. Looking at htop it seems that thrust indeed uses some parallelization as all cores being used.

How to Reproduce

  1. Save the following file as sorting_comparison.cpp:
    
    #include <algorithm>
    #include <chrono>
    #include <iostream>
    #include <random>
    #include <vector>

include <thrust/sort.h>

include <thrust/device_vector.h>

using namespace std; using namespace std::chrono;

const int VEC_SIZE = 100'000'000;

vector random_vector(int size) { vector data(size); random_device rnd_device; mt19937 mersenne_engine{rnd_device()}; uniform_real_distribution dist{-10.0f, 10.0f};

generate(begin(data), end(data),
         [&dist, &mersenne_engine]() { return dist(mersenne_engine); });

return data;

}

int main() { vector data = random_vector(VEC_SIZE); thrust::device_vector data_for_thrust(data);

auto start = high_resolution_clock::now();
sort(begin(data), end(data));
auto stop = high_resolution_clock::now();
cout << "std took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

start = high_resolution_clock::now();
thrust::sort(begin(data_for_thrust), end(data_for_thrust));
stop = high_resolution_clock::now();
cout << "thrust took: " << duration_cast<seconds>(stop-start).count() << "s" << endl;

return 0;

}

2. Make sure thrust is available in include path and compile the file with:
   ```bash
   g++ -fopenmp -lgomp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP sorting_comparison.cpp -o sorting-comparison
  1. Run ./sorting-comparison, it should print timings for std::sort and thrust::sort to stdout.

Expected behavior

I would expect thrust::sort to be at least as fast as std::sort.

Reproduction link

No response

Operating System

Fedora 38

nvidia-smi output

N/A

NVCC version

N/A

dexter2206 commented 10 months ago

By the way, I just checked with clang++ and libomp instead of g++ and libgomp, so the problem seems to not be tied to a specific compiler/openmp implementation.

gevtushenko commented 10 months ago

Hello @dexter2206 and thank you for reporting the issue!

I can reproduce it on a Threadripper:

std took: 34s
thrust took: 80s

Looking at the implementation, the issue is related to the reduction of the number of threads that perform merging. Only 3 seconds out of 80 are spent in the sorting of thread partitions. At the final step, merging is done serially by a single thread.

I think we should experiment with merge path approach, so that all threads are utilized during the merge phases. Apart from that, we should give radix sort a try.