apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[C++] Concatenating a single array is a compaction utility #37878

Open bkietz opened 9 months ago

bkietz commented 9 months ago

Describe the enhancement requested

Concatenation of a single array can be used (recommended? where?) as a memory compaction utility, since it produces a deep copy of the array. This should be better documented and should be tested, since the hot path of passing a single array through unchanged is tempting. (alternatively, Array::DeepCopy might be provided)

See also discussion in PRs:

Component(s)

C++

lidavidm commented 9 months ago

IIRC, it's been recommended in mailing list posts whenever the question of copying an array comes up

It's also in Spark https://github.com/apache/spark/blob/60d02b444e2225b3afbe4955dabbea505e9f769c/python/pyspark/sql/connect/client/core.py#L1287

jorisvandenbossche commented 9 months ago

Some previous discussion on this topic (this issue could be considered as a duplicate of that one if we think we want to add an actual utility, instead of only documenting the concat trick):

The workaround of concat_arrays is also mentioned for the bug in pickling serializing the full buffers instead of only the sliced buffers: https://github.com/apache/arrow/issues/26685

bkietz commented 9 months ago

Well if we haven't already documented concatenation for this purpose, I'd prefer to provide an explicit deep copy utility

lidavidm commented 9 months ago

Even if it's not formally documented, it's used by other libraries and has been recommended to users before. I don't think we can change it at this point.

bkietz commented 9 months ago

Sure, but to me it still seems better to alias concatenation of a single array to deep_copy and document that