Open W-Ely opened 5 months ago
We suspect that the check for unknown commit status in the GlueTableOperations does not account for all situations where commit status may be unknown. For example, an OperationTimeoutException without retries would assume the commit failed but should instead be unknown.
We would recommend that the commit status defaults to unknown instead of failure, and then only flip it to failure for known failures or failing the commit check.
Adding a bit more results of investigation, server side timeouts are not retried. They are assigned a 4xx class error code by the SDK. However, a server side timeout may still result in a successful commit (all business logic is run and the connection times out when trying to transmit the result).
Apache Iceberg version
1.4.2
Query engine
Spark
Please describe the bug 🐞
We are running on AWS EMR so the version is technically 1.4.2-amaz-0.
We found that approximately 10 seconds after the S3 file(s) were written and the Glue metadata entry updated, the S3 files were deleted but the metadata location in Glue was not reset so the current pointed at the S3 file that was removed.
We had to manually update the Glue entry correcting the current to point to the previous and update the previous to the correct previous.
Due to the current still pointing to the metadata that was deleted from S3, this stack trace was raised:
We suspect that when a Glue operation succeeds but the success isn't communicated something in this bit: https://github.com/apache/iceberg/blob/9de693f1e7f46024f47cdc971d8603fd76d87705/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L383 doesn't work quite right.
Similar but not the same issues https://github.com/apache/iceberg/issues/9411 , https://github.com/apache/iceberg/issues/8927