Open ion-elgreco opened 6 months ago
Originally discussed in: https://github.com/delta-io/delta-rs/issues/1225
btw. had a quick look into the datafusion sinks, and I believe they may not be the best fit for us, considering the work delta needs to do on write. More specifically I had look if it would make sense to implement a FileFormat
for Delta...
The TableProvider
does however have more methods available now, that integrate into the framkework - they also did some really great work integrating with @wjones127's multi-part writer....
FYI I plan to add some first-party writer support to the parquet crate as part of https://github.com/apache/arrow-rs/issues/5524
@tustvold awesome!
@aersam this may be interesting to look out for
Description
The rust writer in it current state keeps a buffer instead of steaming to disk which causes the writer use quite some extra memory.
We need to address this performance issue.
@wjones127 mentioned what needs to happen here: https://delta-users.slack.com/archives/C013LCAEB98/p1700330750311529?thread_ts=1700325888.484149&cid=C013LCAEB98
" What we want to do instead is stream out to disk. Right now the writer is ArrowWriter : https://github.com/delta-io/delta-rs/blob/daa700eadaa2a6cc968d0b63cf4c5e7cfd65fc55/crates/deltalake-core/src/writer/record_batch.rs#L252
We should change it so it uses put_multipart and AsyncArrowWriter. That would make the type AsyncArrowWriter<Box<dyn AsyncWrite + Unpin + Send>>
What we want to do instead is combine
"
Another thing is to explore the Datafusion sink functionality as per suggestion of @roeap