[C++] Acero buffer alignment

asfimport commented 2 years ago

This is a general JIRA to centralize some discussion / proposal on a buffer alignment strategy for Acero based on discussions that have happened in a few different contexts. Any actual work will probably span multiple JIRAs, some of which are already in progress.

Motivations:

Incoming data may not be aligned at all. Some kernel functions and parts of Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g. intrinsics) and will fail / cause corruption if care is not taken to use unaligned loads (e.g. ARROW-17783 / https://github.com/apache/arrow/issues/33010). There are parts of the code that assume fixed arrays with size T are at least T aligned. This is generally a safe assumption for data generated by arrow-c++ (which allocates all buffers with 64 byte alignment) but then leads to errors when processing data from flight.
Dataset writes and mid-plan spilling both can benefit form direct I/O, less for performance reasons and more for memory management reasons. However, in order to use direct I/O a buffer needs to be aligned, typically to 512 bytes. This is larger than the current minimum alignment requirements.

Proposal:

Allow the minimum alignment of a memory pool to be configurable. This is similar to the proposal of ARROW-17836 / https://github.com/apache/arrow/issues/33056 but does not require much of an API change (at the expense of being slightly less flexible).
Add a capability to realign a buffer/array/batch/table to a target alignment. This would only modify buffers that are not already aligned. Basically, given an arrow object, and a target memory pool, migrate a buffer to the target memory pool if its alignment is insufficient.
Acero, in the source node, forces all buffers to be 64 byte aligned. This way the internals of Acero don't have to worry about this case. This introduces a performance penalty when buffers are not aligned but is much simpler to maintain and test than trying to support any random buffer. To avoid this penalty it would be simpler to avoid the unaligned buffers in the first place.
Acero requires a memory pool that has 512-byte alignment so that Acero-allocated buffers can use direct I/O. If the default memory pool does not have 512-byte alignment then Acero can use a per-plan pool. This covers the common case for spilling and dataset writing which is that we are partitioning prior to the write and so we are writing Acero-allocated buffers anyways.

Reporter: Weston Pace / @westonpace

Subtasks:

[x] https://github.com/apache/arrow/issues/33317

_{Note: This issue was originally created as ARROW-18115. Please see the migration documentation for further details.}

asfimport commented 2 years ago

Weston Pace / @westonpace: CC @pitrou @save-buffer @michalursa @marsupialtail @bkietz and maybe @rtpsw @icexelloss would be interested.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: How about always using unaligned loads?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Also, it would have been nice to use xsimd as decided for Arrow rather than hand-code intrinsics.

asfimport commented 2 years ago

Sasha Krassovsky / @save-buffer: I agree, always using unaligned loads is probably even easier to implement and I've never seen a performance difference for aligned vs unaligned.

Configure alignment of a memory pool: that does seem like a much smaller code change than my previous PR.

Realigning buffers: I think it could get confusing if an input batch gets some of its buffers in the plan-owned memory pool and part of it is in the global memory pool. I'd be in favor of always reallocating and memcpying batches from external sources so that we can modify buffers in-place eventually. This probably wouldn't be a problem in the common case because the dataset reader would allocate these buffers from within the plan.

Always realign to 64 byte: At this point we should just always realign to 512 byte and have only one memory pool.

asfimport commented 2 years ago

Weston Pace / @westonpace: @save-buffer FYI, there was some further discussion on this topic in https://issues.apache.org/jira/browse/ARROW-17783 and it turned out that the problem wasn't really related to unaligned loads / SIMD and more just the fact that some algorithms rely on buffers being at least value-aligned to support things like reinterpret casting from uint8_t* to uint64_t* safely.

Always realign to 64 byte: At this point we should just always realign to 512 byte and have only one memory pool.

We could relax the constraint to "value-aligned" or maybe just 8-byte aligned. In that case I would expect this operation to be, in most cases, a no-op. One has to go out of their way to obtain a buffer that does not meet that alignment. Even with 64-byte alignment we are still limiting the performance hit to cases where the memory was allocated outside of arrow-c++.

Configure alignment of a memory pool: that does seem like a much smaller code change than my previous PR.

Would this still meet your needs? Would you be interested in updating your PR?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou:

Would this still meet your needs? Would you be interested in updating your PR?

I'm not sure which PR it is about, but I think there's value in being able to pass alignment explicitly on allocation calls. "Configuring" suddenly means you have global state that might be changed in conflicting ways by various users of the memory pool.

asfimport commented 2 years ago

Weston Pace / @westonpace:

I'm not sure which PR it is about, but I think there's value in being able to pass alignment explicitly on allocation calls. "Configuring" suddenly means you have global state that might be changed in conflicting ways by various users of the memory pool.

Passing alignment on allocation calls is preferred for me. So if we want to do that then we should do that. I just thought a smaller change would be easier.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: IIRC there isn't much contention on the PR (at least the fundamentals). It's just that review bandwidth is scarce, and the release was higher priority.

asfimport commented 2 years ago

Sasha Krassovsky / @save-buffer: It would mostly meet my needs, but I'd still want the builders to be able to take a memory pool then so that I can pass one of those "aligned memory pools", so the whole system would be a bit less flexible. It would still work, but I agree that alignment on allocation calls is better (especially since the tedious code is already written and it would avoid having me rebase on a completely different PR :) )

apache / arrow

[C++] Acero buffer alignment #33313

Subtasks: