delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.35k stars 414 forks source link

Python - Leverage _delta_log Statistics to build pyarrow dataset #3014

Open Matthieusalor opened 6 days ago

Matthieusalor commented 6 days ago

Description

One of the big advantage of Delta is that we have statistics at file level, therefore, we know which file might hold data for a given predicate, however, from my understanding, this is not leveraged at all at the python/pyarrow/pandas level. Predicates are being pushed at the `to_table()ˋ level meaning that after partition filters, we are creating a pyarrow dataset with all the remaining files, and pyarrow is then responsible of reading each file metadata to apply the filters. For high latency file system, we would be much better off applying the filters on the file list based on the table statistics.

This is already the case in the rust crate with the find_files function from the delta_data_fusion module. I understand that the predicate is not given in a DNF form nor a pyarrow.Expression, but couldn't we expose an api that would provide this kind of filtering capacity ?

Related Issue(s)

ion-elgreco commented 5 days ago

We already do this, we create pyarrow dataset fragments, each of these file fragments receives a partition expression.

https://github.com/delta-io/delta-rs/blob/25ce38956e25722ba7a6cbc0f5a7dba6b3361554/python/deltalake/table.py#L1189-L1199

Each partition expression is created from the log file stats:

https://github.com/delta-io/delta-rs/blob/25ce38956e25722ba7a6cbc0f5a7dba6b3361554/python/src/lib.rs#L857-L874