Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just list()s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed for to_columns() to be used in any kind of serious way.
To support the "visitor" pattern, I moved some of the PyIterator-specific pieces into the PyIterator so that the visitor can re-use the relevant pieces of ArrayViewBaseIterator. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.
import nanoarrow as na
import pandas as pd
from nanoarrow import visitor
url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
array = na.ArrayStream.from_url(url).read_all()
# to_columns() doesn't (and won't) produce anything numpy or pandas-related
names, columns = visitor.to_columns(array)
# ..but lets data frames be built rather compactly
pd.DataFrame({k: v for k, v in zip(names, columns)})
Assembling columns from chunked things is rather difficult to do and is a valid thing that somebody might want to assemble from Arrow data. This PR adds a "visitor" pattern that can be extended to build "column"s, which are currently just
list()
s. Before trimming down this PR to a managable set of changes, I also implemented the visitor that concatenates data buffers for single data buffer types ( https://gist.github.com/paleolimbot/17263e38b5d97c770e44d33b11181eaf ), which will be needed forto_columns()
to be used in any kind of serious way.To support the "visitor" pattern, I moved some of the
PyIterator
-specific pieces into thePyIterator
so that the visitor can re-use the relevant pieces ofArrayViewBaseIterator
. This pattern also solves one of the problems I had when attempting a "repr" iterator, which is that I was trying to build something rather than iterate over it.