delta-io / delta-sharing

An open protocol for secure data sharing
https://delta.io/sharing
Apache License 2.0
774 stars 173 forks source link

Update delta-kernel to at least 0.4.0 to leverage a lazy `scan.execute` for large tables #602

Open BdeUtra opened 3 weeks ago

BdeUtra commented 3 weeks ago

Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data. From delta-kernel 0.4.0 the scan.execute is lazy and only loads as the iterator is consumed.

I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version. Supporting 0.4.x would open up a lot of possibilities for large data processing.

Any short term plans on supporting delta-kernel 0.4.x ? I'm new to rust and couldn't make it work sadly

relevant bit of the changelog:

Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>

source: https://github.com/delta-incubator/delta-kernel-rs/blob/bd2ea9f2fa44d8bc559659e53d38374309ecf63a/CHANGELOG.md#v040-2024-10-23