ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.82k stars 223 forks source link

Overhaul DataFrame plannning to use DataFusion Extensions. #542

Closed jacksonrnewhouse closed 9 months ago

jacksonrnewhouse commented 9 months ago

This gets rid of the intermediate LogicalPlanExtension graph in favor of rewriting the logical plan, making more use of DataFusion's Extensions. In particular, I've added extensions for aggregations, joins, key calculations, and sources. There are also more synthetic extensions that are there to clarify the physical boundaries and simulate record batch changes that will be executed by the operators. For instance, the tumbling aggregating operator manually inserts the timestamp column after aggregations are calculated, and this is represented by the TimestampAppendExtension.