apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
402 stars 147 forks source link

[feature request] Support reading equality delete files #1210

Open kevinjqliu opened 5 days ago

kevinjqliu commented 5 days ago

Feature Request / Improvement

Only position delete is supported right now https://github.com/apache/iceberg-python/blob/e5a58b34dd830c6ffea11649613b693f70f7cbb4/pyiceberg/table/__init__.py#L1418

Let's also add reading equality delete

Position delete PR https://github.com/apache/iceberg/pull/6775

Zyiqin-Miranda commented 5 days ago

Thanks @kevinjqliu, I can work on this issue

sungwy commented 5 days ago

This will be a fantastic addition to PyIceberg! Thank you for raising this issue @kevinjqliu and @Zyiqin-Miranda 🎉

Zyiqin-Miranda commented 2 days ago

Thanks @kevinjqliu and @sungwy. Starting to add support to current plan_files function for equality deletes, not sure if the current _InclusiveMetricsEvaluator can be directly used to determine whether the equality delete files is relevant to the data files? Seems like Iceberg Java uses canContainEqDeletesForFile instead. My understanding is that position deletes can use lower_bound == upper_bound of file_path column to filter out irrelevant files quickly but equality deletes don't have this advantage, so basically equality deletes can be relevant to any data files within same partition. Thanks for any insights here in advance!

kevinjqliu commented 1 day ago

Equality Delete Files and Scan Planning are good docs for this.

My general understanding is that equality deletes are applied to all data files (across all partitions, if partitioned).

Position delete files must be applied to data files from the same commit, when the data and delete file data sequence numbers are equal. This allows deleting rows that were added in the same commit.