apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.23k stars 3.47k forks source link

[Format] Physical representation of columnar format not well documented #39569

Open rongcuid opened 7 months ago

rongcuid commented 7 months ago

Describe the enhancement requested

Currently the columnar format is only documented at this page: https://arrow.apache.org/docs/format/Columnar.html. However, when I try to actually implement the format, I find the physical representation underdocumented.

Particularly, the encoding of primitive types is unclear. The only info given is an example int32 layout, but no other layouts are given, while other type are unclear. How are booleans represented, for example? Do implementation choose what representation they use? I suppose that's not the case as it will defeat Arrow's goal.

I was pointed to https://github.com/apache/arrow/blob/main/format/Schema.fbs for reference. However, as far as I understand, this specification is only for the IPC schema. It includes specification of type information, but when it comes to physical representation, there's only struct Buffer with a length and offset.

I would like a clear documentation of the memory layout of every type supported by Arrow. An example specification I can think of is CTF, which provides not only layouts of all types, but also side-by-side examples of schema, layout, and values. Similar documentation will be immensely helpful for Arrow, especially showing layouts of various array types.

Component(s)

Format

AlenkaF commented 4 months ago

Thank you for creating the issue @rongcuid!

I am attempting to add a general introductory page to the documentation that would list all the physical layouts with diagrams and basic explanations here: https://github.com/apache/arrow/pull/41593. Reviews welcome!