apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

feat(python): Add visitor pattern + builders for column sequences #454

Closed paleolimbot closed 4 months ago

paleolimbot commented 5 months ago

Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just list()s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed for to_columns() to be used in any kind of serious way.

To support the "visitor" pattern, I moved some of the PyIterator-specific pieces into the PyIterator so that the visitor can re-use the relevant pieces of ArrayViewBaseIterator. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.

import nanoarrow as na
import pandas as pd
from nanoarrow import visitor

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
array = na.ArrayStream.from_url(url).read_all()

# to_columns() doesn't (and won't) produce anything numpy or pandas-related
names, columns = visitor.to_columns(array)

# ..but lets data frames be built rather compactly
pd.DataFrame({k: v for k, v in zip(names, columns)})