Is your feature request related to a problem? Please describe.
The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting read_from_memory_tables=True), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.
Describe the solution you'd like
A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter.
An initial implementation could just delegate to the ParquetReader for implementation.
Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).
Describe alternatives you've considered
Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.
Additional context
Other features to consider
Being able to convert arrow lists as numpy arrays (like the current parquet reader)
Being able to convert arrow structs as csp structs
An option such at that each timestamp, it ticks an arrow record batches containing all the records with that timestamp (in cases where we want to minimize any copy overhead and just operate on arrow types directly). This is essentially a "group-by" on the timestamp column.
Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.
The need has also arisen for this on the output side, i.e. an output adapter that yields in-memory arrow tables (i.e. according to some trigger function).
Is your feature request related to a problem? Please describe. The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting
read_from_memory_tables=True
), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.Describe the solution you'd like A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter. An initial implementation could just delegate to the ParquetReader for implementation.
Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).
Describe alternatives you've considered Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.
Additional context Other features to consider
Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.