[C++] Aggregate kernel should not mandate alignment

asfimport commented 2 years ago

When using arrow's aggregate kernel with table transferred from arrow flight (DoGet), it may crash at arrow::util::CheckAlignment(). However using original data it works well, also if I first serialize the transferred table into bytes then recreate an arrow table using the bytes, it works well.

"flight-alignment-test" attached is the minimal test that can produce the issue, which basically does "sum(total_revenue) group by l_suppkey" using the table from "DoGet()". ("DummyNode" is just used to be the producer of the aggregate node as the producer is required to create the aggregate node)

Reporter: Yifei Yang

Original Issue Attachments:

flight-alignment-test.zip

_{Note: This issue was originally created as ARROW-17783. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: @westonpace Could you look at this?

asfimport commented 2 years ago

Weston Pace / @westonpace: Yes, I'll try and find some time to look at this one. The CheckAlignment method is expecting, at most, 8 byte alignment which I think we enforce on serialization so I'm surprised we get into this situation. However, maybe we are getting here from some kind of slicing or related operation.

asfimport commented 2 years ago

David Li / @lidavidm: I think this pretty much follows from the linked Flight issue then. Flight doesn't actually manage to enforce 8 byte alignment because of the Protobuf container. (I think we can figure out how to do this, at least for C++ clients/servers, but you'll still have the issue of other clients/servers.)

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Just because Arrow C++ enforces some alignment on serialization doesn't mean other producers do the same.

Alignment is a recommendation in Arrow, not a requirement. So you simply cannot rely on alignment for data that may come from other systems.

asfimport commented 2 years ago

Weston Pace / @westonpace: Hm, the format page suggests it is not only a C++ thing:

Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes. When serializing Arrow data for interprocess communication, these alignment and padding requirements are enforced. If possible, we suggest that you prefer using 64-byte alignment and padding. Unless otherwise noted, padded bytes do not need to have a specific value.

That being said, I guess this is just another feature request. It should be solvable with special head/tail handling. And perhaps, since the C data API is not technically "interprocess communication", we can't rely on this anyways.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Hmm, I had forgotten about these requirements. That said, the spec doesn't say precisely which requirement is enforced (is it 8 bytes or 64 bytes?).

Instead of special head/tail handling, you can probably use unaligned load instructions. They probably don't have any downsides (i.e. they are probably just as fast when the data happens to be aligned).

asfimport commented 2 years ago

Weston Pace / @westonpace: My concern is less performance and more complexity and testing. I've made a proposal at ARROW-18115 which would address this (albeit this particular case would take a performance hit)

asfimport commented 2 years ago

David Li / @lidavidm: It should also not be too bad to fix this in Flight (given gRPC generally forces a copy on us anyways); we would only lose the zero-copy in the case that the batch fits in a single gRPC slice (which is presumably relatively small, but I'd have to check what a typical size is).

asfimport commented 2 years ago

David Li / @lidavidm: I guess if you're going for 512 byte alignment then you'd have to deal with this anyways though.

asfimport commented 2 years ago

Weston Pace / @westonpace: We would only be requiring 64-byte alignment for incoming buffers. The 512-byte alignment is for buffers that Acero allocates itself.

asfimport commented 2 years ago

Weston Pace / @westonpace: FWIW, the particular check failing here isn't even looking for 64-byte alignment. It is simply looking for type alignment (which is guaranteed by malloc). The algorithm is not using an SIMD intrinsics. It is comparing values between two byte buffers (of fixed length arrays), given random indices, and doing so by casting the uint8_t* to uint64_t* (or whatever is appropriate). Roughly speaking...


*match = *(reinterpret_cast<const uint64_t*>(left.data()) + row_num) == *(reinterpret_cast<const uint64_t*>(right.data()) + row_num);

A safe alternative, that does not rely on alignment, would be:


*match = memcmp(left.data() + (row_num * 8), right.data() + (row_num * 8), 8) == 0;

I created a hasty benchmark:


void CompareWithMemcmp(const uint8_t* left, const uint8_t* right, uint8_t* out,
                       const int* indices, int length) {
  const int* idx_it = indices;
  const int* idx_end = indices + length;
  uint8_t* out_it = out;
  while (idx_it != idx_end) {
    *out_it = memcmp(left + (*idx_it * 8), right + (*idx_it * 8), 8) == 0;
    out_it++;
    idx_it++;
  }
}

void CompareWithCast(const uint8_t* left, const uint8_t* right, uint8_t* out,
                     const int* indices, int length) {
  const int* idx_it = indices;
  const int* idx_end = indices + length;
  uint8_t* out_it = out;
  while (idx_it != idx_end) {
    *out_it = *(reinterpret_cast<const uint64_t*>(left) + *idx_it) ==
              *(reinterpret_cast<const uint64_t*>(right) + *idx_it);
    out_it++;
    idx_it++;
  }
}

template <bool kUseCast, bool kAligned>
static void RandomCompare(benchmark::State& state) {  // NOLINT non-const reference
  constexpr int kNumElements = 10000;
  const std::vector<int64_t> left = MakeIntegers<int64_t>(kNumElements + 1);
  const std::vector<int64_t> right = MakeIntegers<int64_t>(kNumElements + 1);
  std::vector<int> indices(kNumElements);
  std::vector<uint8_t> matches(kNumElements);
  std::iota(indices.begin(), indices.end(), 0);
  std::default_random_engine gen(42);
  std::shuffle(indices.begin(), indices.end(), gen);
  const uint8_t* left_start = reinterpret_cast<const uint8_t*>(left.data());
  const uint8_t* right_start = reinterpret_cast<const uint8_t*>(right.data());
  if (!kAligned) {
    left_start += 4;
    right_start += 4;
  }
  for (auto _ : state) {
    if (kUseCast) {
      CompareWithCast(left_start, right_start, matches.data(), indices.data(),
                      kNumElements);
    } else {
      CompareWithMemcmp(left_start, right_start, matches.data(), indices.data(),
                        kNumElements);
    }
  }
}

static void RandomCompareMemcmpAligned(benchmark::State& state) {
  RandomCompare<false, true>(state);
}
static void RandomCompareMemcmpUnaligned(benchmark::State& state) {
  RandomCompare<false, false>(state);
}
static void RandomCompareCast(benchmark::State& state) {
  RandomCompare<true, true>(state);
}

Results:


-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
RandomCompareMemcmpAligned         9593 ns         9592 ns        67730
RandomCompareMemcmpUnaligned      10189 ns        10187 ns        65334
RandomCompareCast                  9103 ns         9102 ns        81895

Now, I am sure there is some alternative that can be devised, that is both safe and performant. However, this seems a burden. It's a burden to develop safe algorithms. It's a burden to ensure that we unit test thoroughly enough (e.g. fuzz with random unaligned buffers) to avoid segmentation fault. If someone is willing to doing this work then I'm happy to consider an alternative proposal. However, I'd rather see more development, in a safe fashion, by manually aligning buffers, than spend time supporting a use case that it is not clear has a solid need (e.g. it seems we can fix this particular case and, even if we can't, does the performance hit of forced-alignment matter in this case?).

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: The idiomatic way to work around the alignment issue is to use memcpy. We have dedicated functions for that: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h

Note that modern CPUs don't mind unaligned loads of scalar types. It's just that it's UB in C/C++.

asfimport commented 2 years ago

Weston Pace / @westonpace: Do the compute kernels all have tests to verify that they work correctly on unaligned buffers? I'm not trying to be stubborn here but I haven't been very involved in the development of the compute kernels and honestly don't know.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Unfortunately not. It would actually be a good idea to add generic checks for this (at least for those kernels whose tests use the generic check functions).

asfimport commented 2 years ago

Weston Pace / @westonpace: Let's leave this open then. I'm going to unassign it from myself. At the moment, supporting this use case is not a high priority for me personally, mainly because I'm not aware of a real need. If someone wants to take this work up and help maintain it then I'm happy to review their work. Or if someone has a case where they absolutely need to work with unaligned buffers and can't afford the cost to force alignment then they are welcome to mention it here.

apache / arrow

[C++] Aggregate kernel should not mandate alignment #33010

Original Issue Attachments: