apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
285 stars 59 forks source link

Allow constructing an Arrow stream/file from columnar data with no column names #282

Open NHDaly opened 2 years ago

NHDaly commented 2 years ago

We have a data source (Relations from our database engine at RelationalAI) that have columnar data, but without column names. (We represent a Relation as a Set of Tuples, e.g. movie_title relates movie IDs to Titles, so the positions are meaningful but they do not have names.)

We would like to encode this in Arrow as essentially a Vector of columns. In JSON, we would encode this as:

[
    [1001, 2232, 3582, 4030],
    ["The Matrix", "50 First Dates", "I Am Legend", "The Notebook"]
]

From what I can tell, this is supported by the Arrow spec, but isn't currently supported by the Arrow.jl package?

This is the understanding my colleague and I have come to of the current situation:

Can we work to expose this ability through the Arrow.jl package as well, in the code to construct an Arrow stream from a column-wise data source?

Thanks!

CC: @bachdavi

quinnj commented 2 years ago

Hmmm, yeah, this shouldn't be too bad to support. I think the easiest approach would be to hook into the Tables.jl interface for this. We could create a pseudo-table type like:

struct ArrayOfArraysTable{T}
    source::T
end
Tables.columns(x::ArrayOfArraysTable) = x
Tables.getcolumn(x::ArrayOfArraysTable, i::Int) = x.source[i]

So that should mostly work on the tables side of things in terms of the data. For the schema message writing, we'll get Tables.schema(x::ArrayOfArraysTable) = nothing as the fallback, so I think then we just need another overload for makeschema(b, sch::nothing, columns), where we create the schema message but with no column names.