Open aersam opened 8 months ago
@aersam correct, it's not the efficient way to do that :) Will already mentioned an improvement over that, which I've logged here, no one is working on that yet, so if you want to pick it up feel free :D https://github.com/delta-io/delta-rs/issues/1984
I can pick it up, but I'd rather do it on the write.rs operation
@aersam that's fine!
Ok, I see partioning makes this quite complicated 🙂 And MemoryExec of DataFusion is not helpful, so might take some time
I'll just implement it using chunks. This is not perfect, but should work and is not as invasive as rewriting the whole partitioning
Description
In python/lib.rs, the first thing that happens on
write_to_deltalake
is to collect to batches to a Vec. This loads all RecordBatches into RAM, no? This seems like not a good thing to me. I think the main reason is that write.rs tries to get the schema from the batches, but the schema would have been known in python anyway, so why not pass it directly?Use Case I don't want to waste resources ;)
Related Issue(s)