Allow constructing an Arrow stream/file from columnar data with no column names

We have a data source (Relations from our database engine at RelationalAI) that have columnar data, but without column names. (We represent a Relation as a Set of Tuples, e.g. movie_title relates movie IDs to Titles, so the positions are meaningful but they do not have names.)

We would like to encode this in Arrow as essentially a Vector of columns. In JSON, we would encode this as:

[
    [1001, 2232, 3582, 4030],
    ["The Matrix", "50 First Dates", "I Am Legend", "The Notebook"]
]

From what I can tell, this is supported by the Arrow spec, but isn't currently supported by the Arrow.jl package?

This is the understanding my colleague and I have come to of the current situation:

Looking at the Arrow spec, each RecordBatch message, containing the actual data, is preceded by a Schema message, defining the logical schema of the former. The Schema contains an array of Field types that define the columns of the RecordBatch in proper order. The name property appears to be optional. That would mean we could serialize columns without a name.
- https://github.com/apache/arrow/blob/56d060ca197352f575edced64e6a1fbc9331b336/format/Schema.fbs#L463
The fields in the Schema message are flattened, see: https://arrow.apache.org/docs/format/Columnar.html#recordbatch-message
Arrow.jl does support writing unnamed columns, but only if we supply the data row-wise. Then the resulting arrow schema upon loading contains column names like the following: Symbol("1") (which is a bit cumbersome to work with in Julia).

Can we work to expose this ability through the Arrow.jl package as well, in the code to construct an Arrow stream from a column-wise data source?

Thanks!

CC: @bachdavi

apache / arrow-julia

Allow constructing an Arrow stream/file from columnar data with no column names #282