Open FranMorilloAWS opened 1 day ago
The discussion here might help you: https://github.com/apache/iceberg/pull/10935#issuecomment-2294754336
Hi @pvary , but still is not clear in which scenarios does flink decide to either do a positional delete or an equality delete when commiting the snapshot. Also i have seen snapshots commits that may have both. There is an Alibaba blog post that mentions this is due to avoid inconsistencies but again not clear and not documented anywhere: https://www.alibabacloud.com/blog/how-to-analyze-cdc-data-in-iceberg-data-lake-using-flink_597838
Equality delete:
Positional delete:
Why we need to use both? Is there an example scenario we can go over? Thanks in advanced for answering me :)
Imagine a scenario where a specific Id is updated twice. Equality based delete is not enough in this case to remove the outdated first record and keep the second record. Positional delete is not enough in itself, since we need to find the data file and the specific rownum to delete the record. In edge cases this would require us to do a full table scan for every record...
Query engine
Apache Flink
Question
Can somebody explain how Delete files are implemented with Apache Flink? Spark only makes use of Positional Deletes, but Apache Flink seems that we are using both? Not sure why Flink would need to use both?