Data files name collision written by Spark Streaming job after it's restart

kongul commented 1 year ago

Apache Iceberg version

1.2.1

Query engine

Spark

Please describe the bug 🐞

We have number of Spark jobs that do stream data to Iceberg tables. Recently we faced issue reading those tables - data files were deleted or overridden by other data files with different size (checked older version in s3 bucket). After Investigation this i what we found.

Here's how filename is constructed https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L51-L100 As it said there

   * Constructor with specific operationId. The [partitionId, taskId, operationId] triplet has to be
   * unique across JVM instances otherwise the same file name could be generated by different
   * instances of the OutputFileFactory.

Here we can see that queryId is passed as operationId

Now let's see what is passed there from Spark side https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L159 https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L134C1-L143 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala

So stream metadata file contain in queryId is persisted across Spark Streaming Jobs restarts, hence your requirement The [partitionId, taskId, operationId] triplet has to be unique is violatet. So new streaming job run can generate the same filename that already exists and override exiting file.

https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L91-L100

dotjdk commented 8 months ago

I was about to create a bug report on this too.

We are experiencing the same issue after upgrading from 1.1.x to 1.4.x. The bug was introduced in 1.2.x it seems in this commit:

Spark: Add the query ID to file names (#6569) https://github.com/apache/iceberg/commit/046a81aa734dc4b61b66c3214ba6888a72d68bc9

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

apache / iceberg