delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.35k stars 414 forks source link

Do not load full source into RAM on write_to_deltalake #2255

Open aersam opened 8 months ago

aersam commented 8 months ago

Description

In python/lib.rs, the first thing that happens on write_to_deltalake is to collect to batches to a Vec. This loads all RecordBatches into RAM, no? This seems like not a good thing to me. I think the main reason is that write.rs tries to get the schema from the batches, but the schema would have been known in python anyway, so why not pass it directly?

Use Case I don't want to waste resources ;)

Related Issue(s)

ion-elgreco commented 8 months ago

@aersam correct, it's not the efficient way to do that :) Will already mentioned an improvement over that, which I've logged here, no one is working on that yet, so if you want to pick it up feel free :D https://github.com/delta-io/delta-rs/issues/1984

aersam commented 8 months ago

I can pick it up, but I'd rather do it on the write.rs operation

ion-elgreco commented 8 months ago

@aersam that's fine!

aersam commented 8 months ago

Ok, I see partioning makes this quite complicated 🙂 And MemoryExec of DataFusion is not helpful, so might take some time

aersam commented 8 months ago

I'll just implement it using chunks. This is not perfect, but should work and is not as invasive as rewriting the whole partitioning