Open pitrou opened 1 year ago
@westonpace @felipecrv @wgtmac Thoughts?
(also @js8544 FYI)
I think it would be interesting to see just how much faster working on the K-size batches would be and see if it's worthwhile. This would also be potentially a worthwhile enhancement for the Go side too which works the same way
Did a quick test, looks the gap should be smaller. https://quick-bench.com/q/ywSl6NYEUc2KYq4h6MkP7t3yWqo
The checked version uses gcc __builtin_*_overflow
, same as safe-math.
From disassembly, the uncheck version is vectorized.
By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.
Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.
By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.
Null checks can be on top of the batch arithmetic-safe functions provided the values do not overflow. This can help avoid some unnecessary conditions.
Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.
Python introduces overhead, but that's why I chose a large array size.
@cyb70289 's implementation is different from our current implementation.
Add
doesn't check nulls and simply computes all pairs, null or not, while AddChecked
only computes the valid pairs. (Code is here, it uses MakeArithmeticFunctionNotNull while Add
uses MakeArithmeticFunction). I guess it was assumed that __builtin_*_overflow
is an expensive operation and should be avoided as much as possible. However, I ran some benchmarks and the results don't agree with this assumption.
With null checking as it currently is, AddChecked
is 15\~40x slower and the benefits of null checks only happen when >95% values are null, which I'd say is a rare case.
Without null checking, i.e. also using MakeArithmeticFunction
for AddChecked
, it is only 2.5 times slower, no matter the null proportion of the input:
This explains the difference between @pitrou and @cyb70289's results. This is run on my Mac M1 so it maybe different on x86 machines, but the difference is clear enough.
I wrote a simple kernel exec that checks the nullity only when overflow occurs:
Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
int64_t length = span.length;
auto left_arr = span[0].array;
auto right_arr = span[1].array;
auto* out_span = result->array_span_mutable();
auto* left_it = left_arr.GetValues<int32_t>(1);
auto* right_it = right_arr.GetValues<int32_t>(1);
auto* out_it = out_span->GetValues<int32_t>(1);
for (int64_t i = 0; i < length; ++i) {
if (ARROW_PREDICT_FALSE(AddWithOverflow(*left_it++, *right_it++, out_it++))) {
auto left_valid = left_arr.IsValid(i);
auto right_valid = right_arr.IsValid(i);
if (left_valid && right_valid) {
return Status::Invalid("overflow");
}
}
}
return Status::OK();
}
This achieves similar 2.5x time as the one above.
- detect potential overflow over all K pairs of values
- if no overflow was detected, add all K pairs or values
I'm not sure how to detect overflow without performing the actual addition. So I implemented the following algo:
Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
int64_t length = span.length;
auto left_arr = span[0].array;
auto right_arr = span[1].array;
auto* out_span = result->array_span_mutable();
auto* left_it = left_arr.GetValues<int32_t>(1);
auto* right_it = right_arr.GetValues<int32_t>(1);
auto* out_it = out_span->GetValues<int32_t>(1);
std::vector<int> overflow(length, false);
for (int64_t i = 0; i < length; ++i) {
overflow[i] = AddWithOverflow(*left_it++, *right_it++, out_it++);
}
for (int64_t i = 0; i < length; ++i) {
if (ARROW_PREDICT_FALSE(overflow[i])) {
auto left_valid = left_arr.IsValid(i);
auto right_valid = right_arr.IsValid(i);
if (left_valid && right_valid) {
return Status::Invalid("overflow");
}
}
}
return Status::OK();
}
This is however 5x slower.
Yes, add-with-overflow is basically a single fast CPU instruction, there's probably no point in trying to avoid it.
Since MakeArithmeticFunctionNotNull
appears to generate very suboptimal code, perhaps we should also build a different and more performant abstraction? This would help more cases than just checked arithmetic.
For example, MakeArithmeticFunctionNotNull
(or, actually, ScalarBinaryNotNull
) appears to generate sub-performant kernels, perhaps it should be reworked for better performance?
Basically, instead of declaring a passing ScalarBinaryNotNull
a scalar operation, we should be able to pass it a batched operation. In other words, instead of:
struct AddChecked {
template <typename T, typename Arg0, typename Arg1>
static Call(KernelContext*, Arg0 left, Arg1 right, Status* st) {
T result = 0;
if (ARROW_PREDICT_FALSE(AddWithOverflow(left, right, &result))) {
*st = Status::Invalid("overflow");
}
return result;
}
};
do something like:
struct AddChecked {
// Instructs ScalarBinaryNotNull to use batching
static constexpr bool kBatched = true;
template <typename T, typename Arg0, typename Arg1>
static Status Call(KernelContext*, span<const Arg0> left, span<const Arg1> right, span<T> out) {
T result = 0;
const size_t len = out.size();
bool ok = true;
for (size_t i = 0; i < len; ++i) {
ok &= AddWithOverflow(left[i], right[i], &out[i]);
}
return ok ? Status::OK() : Status::Invalid("overflow");
}
};
I wrote a simple kernel exec that checks the nullity only when overflow occurs:
This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
I definitely agree that passing batches is better than passing single values. A question is how do we handle nulls now? If we still use VisitTwoBitBlocksVoid
like ScalarBinaryNotNull
currently does. The number of if
s will be a real headache.
This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
IMHO This approach has the least number of checks so other algorithms will probably "fall flatter" in such cases. The result is ok if no_overflow || left_is_null || right_is_null
, and overflows are less common than nulls. So I believe short circuiting on overflow should be optimal.
I wrote a simple kernel exec that checks the nullity only when overflow occurs:
This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.
Maybe we can mask the value with valid bit to make sure no overflow can happen if either value is invalid?
Somthing like:
ok &= AddWithOverflow((*left_it++) & (-left_arr.IsValid(i)), (*right_it++) & (-right_arr.IsValid(i)), out_it++);
or
ok &= (AddWithOverflow(*left_it++, *right_it++, out_it++) | (!left_arr.IsValid(i)) | (!right_arr.IsValid(i)));
Should have stable performance, but guess this may introduce too many instructions.
Describe the enhancement requested
Users should generally prefer using the checked versions of arithmetic functions such as "add", because unchecked arithmetic is inherently unsafe and can silently corrupt data (MySQL is commonly hated for this). However, our checked arithmetic functions are quite slower than unchecked:
For example, "add" vs. "add_checked":
... that's 7.8 Gitems/sec vs. 0.5 Gitems/sec., or a 15x difference.
"sum" vs. "sum_checked" shows a similar problem in PR #37536.
Implementation-wise, "add_checked" is currently simple and naïve: it adds and detects overflow on each pair of values, one by one. This adds many conditional branches and probably stifles parallelism on the CPU, as well as auto-vectorization on the compiler.
A potentially better implementation would be to operate on K-sized batches (with a K such as 256), in two steps:
Each step is then unconditional and a reasonable candidate for auto-vectorization.
Of course, we can also consider that 0.5 Gitems/sec is good enough and unlikely to be a bottleneck for non-trivial workloads.
Component(s)
C++