Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data.
From delta-kernel 0.4.0 the scan.execute is lazy and only loads as the iterator is consumed.
I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version.
Supporting 0.4.x would open up a lot of possibilities for large data processing.
Any short term plans on supporting delta-kernel 0.4.x ?
I'm new to rust and couldn't make it work sadly
relevant bit of the changelog:
Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>
Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data. From delta-kernel 0.4.0 the
scan.execute
is lazy and only loads as the iterator is consumed.I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version. Supporting 0.4.x would open up a lot of possibilities for large data processing.
Any short term plans on supporting delta-kernel 0.4.x ? I'm new to rust and couldn't make it work sadly
relevant bit of the changelog:
Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>
source: https://github.com/delta-incubator/delta-kernel-rs/blob/bd2ea9f2fa44d8bc559659e53d38374309ecf63a/CHANGELOG.md#v040-2024-10-23