Open kongul opened 1 year ago
I was about to create a bug report on this too.
We are experiencing the same issue after upgrading from 1.1.x to 1.4.x. The bug was introduced in 1.2.x it seems in this commit:
Spark: Add the query ID to file names (#6569) https://github.com/apache/iceberg/commit/046a81aa734dc4b61b66c3214ba6888a72d68bc9
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Apache Iceberg version
1.2.1
Query engine
Spark
Please describe the bug 🐞
We have number of Spark jobs that do stream data to Iceberg tables. Recently we faced issue reading those tables - data files were deleted or overridden by other data files with different size (checked older version in s3 bucket). After Investigation this i what we found.
Here's how filename is constructed https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L51-L100 As it said there
Here we can see that
queryId
is passed asoperationId
Now let's see what is passed there from Spark side https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L159 https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L134C1-L143 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
So stream metadata file contain in queryId is persisted across Spark Streaming Jobs restarts, hence your requirement
The [partitionId, taskId, operationId] triplet has to be unique
is violatet. So new streaming job run can generate the same filename that already exists and override exiting file.https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L91-L100