delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.97k stars 365 forks source link

Use async writer + multipart + explore Datafusion sink #1984

Open ion-elgreco opened 6 months ago

ion-elgreco commented 6 months ago

Description

The rust writer in it current state keeps a buffer instead of steaming to disk which causes the writer use quite some extra memory.

We need to address this performance issue.

@wjones127 mentioned what needs to happen here: https://delta-users.slack.com/archives/C013LCAEB98/p1700330750311529?thread_ts=1700325888.484149&cid=C013LCAEB98

" What we want to do instead is stream out to disk. Right now the writer is ArrowWriter : https://github.com/delta-io/delta-rs/blob/daa700eadaa2a6cc968d0b63cf4c5e7cfd65fc55/crates/deltalake-core/src/writer/record_batch.rs#L252 We should change it so it uses put_multipart and AsyncArrowWriter. That would make the type AsyncArrowWriter<Box<dyn AsyncWrite + Unpin + Send>> What we want to do instead is combine "

Another thing is to explore the Datafusion sink functionality as per suggestion of @roeap

wjones127 commented 6 months ago

Originally discussed in: https://github.com/delta-io/delta-rs/issues/1225

roeap commented 6 months ago

btw. had a quick look into the datafusion sinks, and I believe they may not be the best fit for us, considering the work delta needs to do on write. More specifically I had look if it would make sense to implement a FileFormat for Delta...

The TableProvider does however have more methods available now, that integrate into the framkework - they also did some really great work integrating with @wjones127's multi-part writer....

tustvold commented 3 months ago

FYI I plan to add some first-party writer support to the parquet crate as part of https://github.com/apache/arrow-rs/issues/5524

ion-elgreco commented 3 months ago

@tustvold awesome!

@aersam this may be interesting to look out for