apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.38k stars 2.2k forks source link

position delete in BaseEqualityDeltaWriter write function will lead to unstable result when equalityFieldColumns is not null and upsert is false #9299

Open sunnyzhuzhu opened 10 months ago

sunnyzhuzhu commented 10 months ago

position delete in BaseEqualityDeltaWriter write function will lead to unstable result when equalityFieldColumns is not null and upsert is false. when equalityFieldColumns is not null and upsert is false, it will not call delete method in BaseEqualityDeltaWriter, but will do postion delete in write method, in this situation, as position delete will only delete equailty key in memory, so if flink checkpoint interval in two job with same source data is not the same, there will be different results in these two job. I think position delete is not need in write method, just call delete or deleteKey method in BaseEqualityDeltaWriter is enough.

pvary commented 10 months ago

Do I understand correctly, that you have 2 different jobs writing to the same table in update mode?

I think this situation should be avoided.

Flink update mode deletes the rows like this:

This is designed for a single writer case, but could work for multiple writers too, with one serious caveat. If you have multiple writers, then you never know which one will checkpoint first. The second checkpoint will overwrite the results of the first one.

So in the end the table contents are defined by the Iceberg commit time, and not the time when the actual update happens.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.