apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.68k stars 3.56k forks source link

[C++][Compute] Checked arithmetic functions are slow-ish #37678

Open pitrou opened 1 year ago

pitrou commented 1 year ago

Describe the enhancement requested

Users should generally prefer using the checked versions of arithmetic functions such as "add", because unchecked arithmetic is inherently unsafe and can silently corrupt data (MySQL is commonly hated for this). However, our checked arithmetic functions are quite slower than unchecked:

For example, "add" vs. "add_checked":

$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add(a, a)"
2000 loops, best of 5: 128 usec per loop
$ python -m timeit -s "import pyarrow as pa, pyarrow.compute as pc; a = pa.array([42]*1_000_000, type='int32')" "pc.add_checked(a, a)"
200 loops, best of 5: 1.92 msec per loop

... that's 7.8 Gitems/sec vs. 0.5 Gitems/sec., or a 15x difference.

"sum" vs. "sum_checked" shows a similar problem in PR #37536.

Implementation-wise, "add_checked" is currently simple and naïve: it adds and detects overflow on each pair of values, one by one. This adds many conditional branches and probably stifles parallelism on the CPU, as well as auto-vectorization on the compiler.

A potentially better implementation would be to operate on K-sized batches (with a K such as 256), in two steps:

  1. detect potential overflow over all K pairs of values
  2. if no overflow was detected, add all K pairs or values

Each step is then unconditional and a reasonable candidate for auto-vectorization.

Of course, we can also consider that 0.5 Gitems/sec is good enough and unlikely to be a bottleneck for non-trivial workloads.

Component(s)

C++

pitrou commented 1 year ago

@westonpace @felipecrv @wgtmac Thoughts?

pitrou commented 1 year ago

(also @js8544 FYI)

zeroshade commented 1 year ago

I think it would be interesting to see just how much faster working on the K-size batches would be and see if it's worthwhile. This would also be potentially a worthwhile enhancement for the Go side too which works the same way

cyb70289 commented 1 year ago

Did a quick test, looks the gap should be smaller. https://quick-bench.com/q/ywSl6NYEUc2KYq4h6MkP7t3yWqo

The checked version uses gcc __builtin_*_overflow, same as safe-math. From disassembly, the uncheck version is vectorized.

pitrou commented 1 year ago

By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.

wgtmac commented 1 year ago

Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.

By the way, the benchmarks are posted don't involve any nulls, so the real-world differences might be more nuanced.

Null checks can be on top of the batch arithmetic-safe functions provided the values do not overflow. This can help avoid some unnecessary conditions.

pitrou commented 1 year ago

Does python introduce some overhead on this? It seems that gap on the pure C++ side (from @cyb70289) is smaller.

Python introduces overhead, but that's why I chose a large array size.

@cyb70289 's implementation is different from our current implementation.

js8544 commented 1 year ago

Add doesn't check nulls and simply computes all pairs, null or not, while AddChecked only computes the valid pairs. (Code is here, it uses MakeArithmeticFunctionNotNull while Add uses MakeArithmeticFunction). I guess it was assumed that __builtin_*_overflow is an expensive operation and should be avoided as much as possible. However, I ran some benchmarks and the results don't agree with this assumption. With null checking as it currently is, AddChecked is 15\~40x slower and the benefits of null checks only happen when >95% values are null, which I'd say is a rare case. image Without null checking, i.e. also using MakeArithmeticFunction for AddChecked, it is only 2.5 times slower, no matter the null proportion of the input: image

This explains the difference between @pitrou and @cyb70289's results. This is run on my Mac M1 so it maybe different on x86 machines, but the difference is clear enough.

js8544 commented 1 year ago

I wrote a simple kernel exec that checks the nullity only when overflow occurs:

Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
  int64_t length = span.length;
  auto left_arr = span[0].array;
  auto right_arr = span[1].array;
  auto* out_span = result->array_span_mutable();

  auto* left_it = left_arr.GetValues<int32_t>(1);
  auto* right_it = right_arr.GetValues<int32_t>(1);
  auto* out_it = out_span->GetValues<int32_t>(1);
  for (int64_t i = 0; i < length; ++i) {
    if (ARROW_PREDICT_FALSE(AddWithOverflow(*left_it++, *right_it++, out_it++))) {
      auto left_valid = left_arr.IsValid(i);
      auto right_valid = right_arr.IsValid(i);
      if (left_valid && right_valid) {
        return Status::Invalid("overflow");
      }
    }
  }

  return Status::OK();
}

This achieves similar 2.5x time as the one above.

  1. detect potential overflow over all K pairs of values
  2. if no overflow was detected, add all K pairs or values

I'm not sure how to detect overflow without performing the actual addition. So I implemented the following algo:

  1. Compute all additions and save the overflow status to a vector.
  2. Loop the overflow vector and check for nullity.
Status AddCheckedExec(KernelContext* ctx, const ExecSpan& span, ExecResult* result) {
  int64_t length = span.length;
  auto left_arr = span[0].array;
  auto right_arr = span[1].array;
  auto* out_span = result->array_span_mutable();

  auto* left_it = left_arr.GetValues<int32_t>(1);
  auto* right_it = right_arr.GetValues<int32_t>(1);
  auto* out_it = out_span->GetValues<int32_t>(1);
  std::vector<int> overflow(length, false);

  for (int64_t i = 0; i < length; ++i) {
    overflow[i] = AddWithOverflow(*left_it++, *right_it++, out_it++);
  }

  for (int64_t i = 0; i < length; ++i) {
    if (ARROW_PREDICT_FALSE(overflow[i])) {
      auto left_valid = left_arr.IsValid(i);
      auto right_valid = right_arr.IsValid(i);
      if (left_valid && right_valid) {
        return Status::Invalid("overflow");
      }
    }
  }

  return Status::OK();
}

This is however 5x slower.

pitrou commented 1 year ago

Yes, add-with-overflow is basically a single fast CPU instruction, there's probably no point in trying to avoid it.

pitrou commented 1 year ago

Since MakeArithmeticFunctionNotNull appears to generate very suboptimal code, perhaps we should also build a different and more performant abstraction? This would help more cases than just checked arithmetic.

For example, MakeArithmeticFunctionNotNull (or, actually, ScalarBinaryNotNull) appears to generate sub-performant kernels, perhaps it should be reworked for better performance?

Basically, instead of declaring a passing ScalarBinaryNotNull a scalar operation, we should be able to pass it a batched operation. In other words, instead of:

struct AddChecked {
  template <typename T, typename Arg0, typename Arg1>
  static Call(KernelContext*, Arg0 left, Arg1 right, Status* st) {
    T result = 0;
    if (ARROW_PREDICT_FALSE(AddWithOverflow(left, right, &result))) {
      *st = Status::Invalid("overflow");
    }
    return result;
  }
};

do something like:

struct AddChecked {
  // Instructs ScalarBinaryNotNull to use batching
  static constexpr bool kBatched = true;

  template <typename T, typename Arg0, typename Arg1>
  static Status Call(KernelContext*, span<const Arg0> left, span<const Arg1> right, span<T> out) {
    T result = 0;
    const size_t len = out.size();
    bool ok = true;
    for (size_t i = 0; i < len; ++i) {
      ok &= AddWithOverflow(left[i], right[i], &out[i]);
    }
    return ok ? Status::OK() : Status::Invalid("overflow");
  }
};
pitrou commented 1 year ago

I wrote a simple kernel exec that checks the nullity only when overflow occurs:

This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.

js8544 commented 1 year ago

I definitely agree that passing batches is better than passing single values. A question is how do we handle nulls now? If we still use VisitTwoBitBlocksVoid like ScalarBinaryNotNull currently does. The number of ifs will be a real headache.

js8544 commented 1 year ago

This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.

IMHO This approach has the least number of checks so other algorithms will probably "fall flatter" in such cases. The result is ok if no_overflow || left_is_null || right_is_null, and overflows are less common than nulls. So I believe short circuiting on overflow should be optimal.

cyb70289 commented 1 year ago

I wrote a simple kernel exec that checks the nullity only when overflow occurs:

This is an interesting approach, but it may fall flat if many of the values underlying null entries trigger overflow.

Maybe we can mask the value with valid bit to make sure no overflow can happen if either value is invalid? Somthing like: ok &= AddWithOverflow((*left_it++) & (-left_arr.IsValid(i)), (*right_it++) & (-right_arr.IsValid(i)), out_it++); or ok &= (AddWithOverflow(*left_it++, *right_it++, out_it++) | (!left_arr.IsValid(i)) | (!right_arr.IsValid(i))); Should have stable performance, but guess this may introduce too many instructions.