`thrust::all_of` is slower than a naive reduction

jrhemstad commented 5 years ago

In a thrust::all_of, when the first element that violates the predicate is discovered, the computation can be aborted, i.e., an "early exit".

For example, imagine you are given a thrust::device_vector<int64_t> and want to check if any of the values are negative. You could do this with a thrust::all_of or with a thrust::count_if:

thrust::device_vector<int64_t> values(...);
bool all_positive = thrust::all_of(values.begin(), values.end(), [](auto v){return v > 0;});
bool all_positive = values.size() == thrust::count_if(values.begin(), values.end(), [](auto v){return v > 0;});

count_if must read everything in values, whereas all_of can shortcut if an early exit exists. Therefore, I would expect all_of to out perform count_if when one or more negative values exist. If no negative values are present, then both all_of and count_if must read everything in values and I would expect their performance to be roughly equivalent.

However, this is not the case. I have found that the performance of thrust::all_of is extremely erratic with a 10x difference between the best and worst performance. Furthermore, an all_of is always slower than a naive reduction as in count_if.

Here are the results of performing 100 trials of the example I described above on an input size of 100,000,000 million int64_t elements on a GV100.

No Early Exit

	mean (us)	min (us)	max (us)
`all_of`	75269	56403	104922
`count_if`	3124	1686	4413

Single Early Exit

	mean (us)	min (us)	max (us)
`all_of`	51620	9346	370845
`count_if`	3100	1703	5158

As you can see, whether or not an early exit exists, all_of is always significantly slower than a count_if.

Looking at the profile of all_of (attached), it appears that the reason it is so slow is because a single invocation of all_of results in ~50 invocations of DeviceReduceKernel. I suspect this is because the implementation of all_of does a set of batched reductions in attempt to avoid reading the entire input when an early exit exists. However, launching all of these small kernels (each with their own allocation/free) results in a significant amount of overhead. This overhead is exacerbated by the fact that each batch is executed on the same stream, meaning there is no overlap or concurrency between batches.

I suspect a better implementation would launch a single kernel, where threads occasionally poll an atomic flag to check if an early exit exists, at which point they exit the computation. Or, forgo an attempt at an early exit and just do the naive reduction like in count_if.

profile nsys_profile.zip

Reproducer code:

// compile with `nvcc --std=c++14 -O3 --expt-extended-lambda thrust_logical.cu -o thrust_logical -lnvToolsExt`
#include <cxxabi.h>
#include <nvToolsExt.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/random.h>
#include <chrono>
#include <limits>

template <typename T>
struct time_result {
  T min{std::numeric_limits<T>::max()};
  T max{std::numeric_limits<T>::lowest()};
  T mean{0};
  T sum{0};
  std::size_t count{0};

  void add_measurement(T new_duration) {
    ++count;
    sum += new_duration;
    mean = sum / count;
    min = std::min(min, new_duration);
    max = std::max(max, new_duration);
  }

  std::string to_string() {
    return std::string{
        "count: " + std::to_string(count) + " mean: " + std::to_string(mean) +
        " min: " + std::to_string(min) + " max: " + std::to_string(max)};
  }
};

template <typename Duration = std::chrono::microseconds, typename F,
          typename... Args>
typename Duration::rep time_it(std::string const& name, F&& fun,
                               Args&&... args) {
  const auto begin = std::chrono::high_resolution_clock::now();
  nvtxRangePushA(name.c_str());
  std::forward<F>(fun)(std::forward<Args>(args)...);
  nvtxRangePop();
  const auto end = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<Duration>(end - begin).count();
}

template <typename Duration = std::chrono::microseconds,
          typename InputGenerator, typename F, typename... Args>
auto time_trial(std::string const& name, std::size_t num_trials, InputGenerator&& generator, F&& f,
                Args&&... args) {
  time_result<typename Duration::rep> result{};
  for (auto i = 0; i < num_trials; ++i) {
    auto input = generator();
    result.add_measurement(time_it<Duration>(name, std::forward<F>(f), input,
                                             std::forward<Args>(args)...));
  }
  return result;
}

struct is_positive {
  template <typename T>
  bool __device__ operator()(T v) {
    return v > 0;
  }
};

int main(void) {
  constexpr std::size_t input_size{100'000'000};
  constexpr std::size_t num_trials{100};

  auto all_of = [](auto const& values) {
    return thrust::all_of(thrust::device, values.begin(), values.end(),
                          is_positive{});
  };

  auto count_if = [](auto const& values) {
    return thrust::count_if(thrust::device, values.begin(), values.end(),
                            is_positive{});
  };

  auto no_early_out = []() {
    nvtxRangePushA("no early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  auto early_out = []() {
    nvtxRangePushA("early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    thrust::default_random_engine engine(
        std::chrono::high_resolution_clock::now().time_since_epoch().count());
    thrust::uniform_int_distribution<std::size_t> distribution{0, input_size};
    auto random_location = distribution(engine);
    values[random_location] = -1;
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  std::cout << "No early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, no_early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, no_early_out, count_if).to_string()
            << std::endl;

  std::cout << std::endl << "With early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, early_out, count_if).to_string()
            << std::endl;

  return 0;
}

Tasks

[ ] https://github.com/NVIDIA/cccl/issues/2113

jrhemstad commented 5 years ago

I updated the original issue to reflect a correction I made in my benchmark code where the location of the "early out" element changes for each trial. This exposed significantly more variance in the all_of results and shows that it is always slower than a count_if.

karthikeyann commented 5 years ago

thrust::all_of and thrust::any_of are implemented using thrust::find_if, /usr/local/cuda-10.0/targets/x86_64-linux/include/thrust/system/detail/generic/find.inl

 91   // this implementation breaks up the sequence into separate intervals
 92   // in an attempt to early-out as soon as a value is found
 93
 94   // TODO incorporate sizeof(InputType) into interval_threshold and round to multiple of 32
 95   const difference_type interval_threshold = 1 << 20;
 96   const difference_type interval_size = (thrust::min)(interval_threshold, n);

could implementing this // TODO solve the performance issue? (if this issue does not arise or is less severe for smaller datatype, this might solve the issue)

karthikeyann commented 5 years ago

@jrhemstad Additional details: thrust::count_if uses _transformreduce, which uses thrust::plus thrust::find_if uses reduce on a tuple(pred, index) with thrust::min operator on index.

index is not necessary for all_of or any_of. This tuple<bool, size_t> will consume more registers too.

Alternative implementation of thrust::any_of could be using _transformreduce with thrust::maximum or thrust::logical_or operator on pred result. Similarly for thrust::all_of using _transformreduce with thrust::minimum or thrust::logical_and. Benchmarked using following extra code.

  auto reduce_and = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, true, thrust::logical_and<bool>{} );
  };

  auto reduce_min = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, false, thrust::minimum<bool>{} );
  };

Runtime are similar for count, logical_and, minimum.

I created new thrust::any_of using _transformreduce with _logicalor, and used it for thrust::all_of (along with early exit). This is faster.

No early out(us): all of: count: 100 mean: 62023 min: 57710 max: 65030 count if: count: 100 mean: 2478 min: 1577 max: 3471 reduce and: count: 100 mean: 2469 min: 1588 max: 3476 reduce min: count: 100 mean: 2494 min: 1566 max: 3477 new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us): all of: count: 100 mean: 33574 min: 4148 max: 114289 count if: count: 100 mean: 2466 min: 1615 max: 2926 reduce and: count: 100 mean: 2536 min: 1577 max: 3792 reduce min: count: 100 mean: 2518 min: 1585 max: 3528 new all_of: count: 100 mean: 918 min: 381 max: 1949

karthikeyann commented 5 years ago

new_any_of and new_all_of implementation attached with

Reproducer code:

thrust_logical.cu.zip

jrhemstad commented 5 years ago

I don't know what new all_of is, but there must be something wrong with the implementation because these numbers are impossible:

No early out(us): new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us): new all_of: count: 100 mean: 918 min: 381 max: 1949

If no early out exists, then you need to read all 100,000,000 int64_t elements in the input.

(100,000,000 * 8B) / 352us -> 2.2 TB/s

That's well over the 900GB/s theoretical peak of a V100 GPU.

jrhemstad commented 5 years ago

I would expect any reduction based implementation to perform the same (as your results show). Since reduction is bandwidth bound, it doesn't really matter what your binary operator is (sum, or, and, etc.) in the reduction.

Furthermore, your results are fishy because if an early out does not exist, then the new all_of implementation should not be any faster than any of the other reduction based implementations. Since your new all_of is just doing a batched transform_reduce, how could it be faster than just doing a single transform_reduce?

karthikeyann commented 5 years ago

You are right. My implementation has a bug. I fixed it and have the updated benchmarks.

No early out(us):
all of: count: 100 mean: 63981 min: 59478 max: 67889
count if: count: 100 mean: 2515 min: 1588 max: 3482
reduce and: count: 100 mean: 2454 min: 1582 max: 2946
reduce min: count: 100 mean: 2462 min: 1573 max: 3475
new all_of: count: 100 mean: 38071 min: 34610 max: 102533

With early out(us):
all of: count: 100 mean: 36221 min: 6262 max: 64545
count if: count: 100 mean: 2434 min: 1565 max: 2843
reduce and: count: 100 mean: 2448 min: 1624 max: 3183
reduce min: count: 100 mean: 2462 min: 1562 max: 3470
new all_of: count: 100 mean: 24840 min: 934 max: 94396

new_all_of is slower. In fact, max time is worst among all. (early out min is only faster!)

jrhemstad commented 5 years ago

@karthikeyann Those results look much more like what I would expect.

While the new all_of can be faster, the fact that on average it is 10x slower confirms in my mind that the extra complexity of trying to take advantage of an early out actually harms performance in the general case.

NVIDIA / cccl