Open jrhemstad opened 5 years ago
I updated the original issue to reflect a correction I made in my benchmark code where the location of the "early out" element changes for each trial. This exposed significantly more variance in the all_of
results and shows that it is always slower than a count_if
.
thrust::all_of
and thrust::any_of
are implemented using thrust::find_if
,
/usr/local/cuda-10.0/targets/x86_64-linux/include/thrust/system/detail/generic/find.inl
91 // this implementation breaks up the sequence into separate intervals
92 // in an attempt to early-out as soon as a value is found
93
94 // TODO incorporate sizeof(InputType) into interval_threshold and round to multiple of 32
95 const difference_type interval_threshold = 1 << 20;
96 const difference_type interval_size = (thrust::min)(interval_threshold, n);
could implementing this // TODO
solve the performance issue?
(if this issue does not arise or is less severe for smaller datatype, this might solve the issue)
@jrhemstad
Additional details:
thrust::count_if
uses _transformreduce, which uses thrust::plus
thrust::find_if
uses reduce on a tuple(pred, index)
with thrust::min
operator on index.
index is not necessary for all_of or any_of. This tuple<bool, size_t> will consume more registers too.
Alternative implementation of thrust::any_of
could be using _transformreduce with thrust::maximum
or thrust::logical_or
operator on pred result.
Similarly for thrust::all_of
using _transformreduce with thrust::minimum
or thrust::logical_and
.
Benchmarked using following extra code.
auto reduce_and = [](auto const& values) {
return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
is_positive{}, true, thrust::logical_and<bool>{} );
};
auto reduce_min = [](auto const& values) {
return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
is_positive{}, false, thrust::minimum<bool>{} );
};
Runtime are similar for count, logical_and, minimum.
I created new thrust::any_of
using _transformreduce with _logicalor, and used it for thrust::all_of
(along with early exit). This is faster.
No early out(us): all of: count: 100 mean: 62023 min: 57710 max: 65030 count if: count: 100 mean: 2478 min: 1577 max: 3471 reduce and: count: 100 mean: 2469 min: 1588 max: 3476 reduce min: count: 100 mean: 2494 min: 1566 max: 3477 new all_of: count: 100 mean: 933 min: 352 max: 1942
With early out(us): all of: count: 100 mean: 33574 min: 4148 max: 114289 count if: count: 100 mean: 2466 min: 1615 max: 2926 reduce and: count: 100 mean: 2536 min: 1577 max: 3792 reduce min: count: 100 mean: 2518 min: 1585 max: 3528 new all_of: count: 100 mean: 918 min: 381 max: 1949
I don't know what new all_of
is, but there must be something wrong with the implementation because these numbers are impossible:
No early out(us): new all_of: count: 100 mean: 933 min: 352 max: 1942
With early out(us): new all_of: count: 100 mean: 918 min: 381 max: 1949
If no early out exists, then you need to read all 100,000,000 int64_t
elements in the input.
(100,000,000 * 8B) / 352us -> 2.2 TB/s
That's well over the 900GB/s theoretical peak of a V100 GPU.
I would expect any reduction based implementation to perform the same (as your results show). Since reduction is bandwidth bound, it doesn't really matter what your binary operator is (sum, or, and, etc.) in the reduction.
Furthermore, your results are fishy because if an early out does not exist, then the new all_of
implementation should not be any faster than any of the other reduction based implementations. Since your new all_of
is just doing a batched transform_reduce
, how could it be faster than just doing a single transform_reduce
?
You are right. My implementation has a bug. I fixed it and have the updated benchmarks.
No early out(us):
all of: count: 100 mean: 63981 min: 59478 max: 67889
count if: count: 100 mean: 2515 min: 1588 max: 3482
reduce and: count: 100 mean: 2454 min: 1582 max: 2946
reduce min: count: 100 mean: 2462 min: 1573 max: 3475
new all_of: count: 100 mean: 38071 min: 34610 max: 102533
With early out(us):
all of: count: 100 mean: 36221 min: 6262 max: 64545
count if: count: 100 mean: 2434 min: 1565 max: 2843
reduce and: count: 100 mean: 2448 min: 1624 max: 3183
reduce min: count: 100 mean: 2462 min: 1562 max: 3470
new all_of: count: 100 mean: 24840 min: 934 max: 94396
new_all_of
is slower. In fact, max time is worst among all. (early out min is only faster!)
@karthikeyann Those results look much more like what I would expect.
While the new all_of
can be faster, the fact that on average it is 10x slower confirms in my mind that the extra complexity of trying to take advantage of an early out actually harms performance in the general case.
In a
thrust::all_of
, when the first element that violates the predicate is discovered, the computation can be aborted, i.e., an "early exit".For example, imagine you are given a
thrust::device_vector<int64_t>
and want to check if any of the values are negative. You could do this with athrust::all_of
or with athrust::count_if
:count_if
must read everything invalues
, whereasall_of
can shortcut if an early exit exists. Therefore, I would expectall_of
to out performcount_if
when one or more negative values exist. If no negative values are present, then bothall_of
andcount_if
must read everything invalues
and I would expect their performance to be roughly equivalent.However, this is not the case. I have found that the performance of
thrust::all_of
is extremely erratic with a 10x difference between the best and worst performance. Furthermore, anall_of
is always slower than a naive reduction as incount_if
.Here are the results of performing 100 trials of the example I described above on an input size of 100,000,000 million
int64_t
elements on a GV100.No Early Exit
all_of
count_if
Single Early Exit
all_of
count_if
As you can see, whether or not an early exit exists,
all_of
is always significantly slower than acount_if
.Looking at the profile of
all_of
(attached), it appears that the reason it is so slow is because a single invocation ofall_of
results in ~50 invocations ofDeviceReduceKernel
. I suspect this is because the implementation ofall_of
does a set of batched reductions in attempt to avoid reading the entire input when an early exit exists. However, launching all of these small kernels (each with their own allocation/free) results in a significant amount of overhead. This overhead is exacerbated by the fact that each batch is executed on the same stream, meaning there is no overlap or concurrency between batches.I suspect a better implementation would launch a single kernel, where threads occasionally poll an atomic flag to check if an early exit exists, at which point they exit the computation. Or, forgo an attempt at an early exit and just do the naive reduction like in
count_if
.nsys_profile.zip
Reproducer code:
Tasks