Open kevinjqliu opened 5 days ago
Thanks @kevinjqliu, I can work on this issue
This will be a fantastic addition to PyIceberg! Thank you for raising this issue @kevinjqliu and @Zyiqin-Miranda 🎉
Thanks @kevinjqliu and @sungwy. Starting to add support to current plan_files
function for equality deletes, not sure if the current _InclusiveMetricsEvaluator can be directly used to determine whether the equality delete files is relevant to the data files?
Seems like Iceberg Java uses canContainEqDeletesForFile instead.
My understanding is that position deletes can use lower_bound
== upper_bound
of file_path
column to filter out irrelevant files quickly but equality deletes don't have this advantage, so basically equality deletes can be relevant to any data files within same partition. Thanks for any insights here in advance!
Equality Delete Files and Scan Planning are good docs for this.
My general understanding is that equality deletes are applied to all data files (across all partitions, if partitioned).
Position delete files must be applied to data files from the same commit, when the data and delete file data sequence numbers are equal. This allows deleting rows that were added in the same commit.
Feature Request / Improvement
Only position delete is supported right now https://github.com/apache/iceberg-python/blob/e5a58b34dd830c6ffea11649613b693f70f7cbb4/pyiceberg/table/__init__.py#L1418
Let's also add reading equality delete
Position delete PR https://github.com/apache/iceberg/pull/6775