ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.8k stars 220 forks source link

Add support for Delta Lake table format both source and sink #710

Open andrei-ionescu opened 3 months ago

andrei-ionescu commented 3 months ago

This can be implemented using the Delta-RS library.

I've also seen that in the documentation there is a Delta Lake Sink connector — https://doc.arroyo.dev/connectors/delta — but I couldn't find it in this repository. Where can I find the Delta Lake Sink connector? If it's under another connector should we make it a first-class citizen?

mwylde commented 3 months ago

Adding support for delta as a source would be great! The delta connector is implemented on top of the filesystem connector, since most of the complexity is in consistently writing the data to S3 (see https://www.arroyo.dev/blog/streaming-to-s3-is-hard), not handling the delta metadata.

Most of the delta code is here: https://github.com/ArroyoSystems/arroyo/blob/master/crates/arroyo-connectors/src/filesystem/sink/delta.rs. It's integrated into the filesystem connector's two-phase commit handler in https://github.com/ArroyoSystems/arroyo/blob/master/crates/arroyo-connectors/src/filesystem/sink/mod.rs.