One of the big advantage of Delta is that we have statistics at file level, therefore, we know which file might hold data for a given predicate, however, from my understanding, this is not leveraged at all at the python/pyarrow/pandas level. Predicates are being pushed at the `to_table()ˋ level meaning that after partition filters, we are creating a pyarrow dataset with all the remaining files, and pyarrow is then responsible of reading each file metadata to apply the filters. For high latency file system, we would be much better off applying the filters on the file list based on the table statistics.
This is already the case in the rust crate with the find_files function from the delta_data_fusion module. I understand that the predicate is not given in a DNF form nor a pyarrow.Expression, but couldn't we expose an api that would provide this kind of filtering capacity ?
Description
One of the big advantage of Delta is that we have statistics at file level, therefore, we know which file might hold data for a given predicate, however, from my understanding, this is not leveraged at all at the python/pyarrow/pandas level. Predicates are being pushed at the `to_table()ˋ level meaning that after partition filters, we are creating a pyarrow dataset with all the remaining files, and pyarrow is then responsible of reading each file metadata to apply the filters. For high latency file system, we would be much better off applying the filters on the file list based on the table statistics.
This is already the case in the rust crate with the find_files function from the delta_data_fusion module. I understand that the predicate is not given in a DNF form nor a pyarrow.Expression, but couldn't we expose an api that would provide this kind of filtering capacity ?
Related Issue(s)