apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.38k forks source link

[C++] Copy with compaction to a different device #43055

Open jorisvandenbossche opened 5 days ago

jorisvandenbossche commented 5 days ago

We have several issues about adding some "copy" utility that compacts the arrays, i.e. truncates all sliced arrays, child arrays and buffers to just the part that is needed to represent the data (https://github.com/apache/arrow/issues/37878, https://github.com/apache/arrow/issues/30503, https://github.com/apache/arrow/issues/38806).

We also nowadays have the utility to copy data to a different device (Array::CopyTo, same for RecordBatch). But similarly as for the above issues, one might want to just copy the required part and not the full buffers (right now, copying an Array to a different devices just copies the underlying buffers as-is, and so copying a sliced array just copies the full array).

Somewhat related (could potentially share code) is that we also want this functionality for writing non-CPU data to IPC: https://github.com/apache/arrow/issues/43029

felipecrv commented 5 days ago

This opens up another question: should Concatenate() handle non-CPU arrays? I think that would reduce the amount of duplication. Concatenating and slicing are not very different [1] and the logic for some types is quite involved (e.g. list-views).

[1] Because when concatenating we try to take the least amount of data per operand. This would force the re-thinking of utilities like list_util::RangeOfValuesUsed as we would need code that can handle that calculation without copying entire offsets buffers more than once.