Point72 / csp

csp is a high performance reactive stream processing library, written in C++ and Python
https://github.com/Point72/csp/wiki
Apache License 2.0
191 stars 33 forks source link

Proposed ArrowRecordBatchAdapter to replace ParquetReader with `read_from_memory_tables=True` #297

Open ptomecek opened 3 months ago

ptomecek commented 3 months ago

Is your feature request related to a problem? Please describe. The ability to stream a sequence of in-memory arrow tables into csp is very powerful, but currently a bit hidden within the ParquetReader implementation (by setting read_from_memory_tables=True), making it hard for users to find. It can also be a bit tricky to use properly as many of the arguments to the parquet reader are ignored in this mode.

Describe the solution you'd like A dedicated pull adapter that consumes a serquence of arrow record batches (or tables) as efficiently as possible. i.e. ArrowRecordBatchAdapter. An initial implementation could just delegate to the ParquetReader for implementation.

Also, the solution should ideally be zero-copy on the underlying arrow tables (I am not positive this is the case at the moment).

Describe alternatives you've considered Continuing to use the parquet reader is confusing and non-intuitive, especially for cases where the arrow tables come from other sources.

Additional context Other features to consider

Note that since there are a number of other tools that can efficiently and flexibly produce sequences of arrow tables from different sources (polars, duckdb, arrow-odbc, ray), having a generic ArrowRecordBatchAdapter will allow an even greater number of historical data connections with very little additional effort.

ptomecek commented 3 weeks ago

The need has also arisen for this on the output side, i.e. an output adapter that yields in-memory arrow tables (i.e. according to some trigger function).