databricks / iceberg-kafka-connect

Apache License 2.0
213 stars 47 forks source link

Duplicate file path found in the Iceberg metadata snapshot #295

Open jbolesjc opened 1 month ago

jbolesjc commented 1 month ago

We have our connectors running and sinking data to our Iceberg catalog in Glue/S3. However when trying to surface the data in Snowflake a few of these iceberg tables ran into this error from Snowflake.

Duplicate file path found in the Iceberg metadata snapshot. Please check that your Iceberg metadata generation is producing valid manifest files and refresh to a newer snapshot once fixed.

We are still trying to sort out where/why this is happening by combing through the manifest and snapshot files.

But looks like the tabular connector has created some invalid duplicates within the snapshot files.

jbolesjc commented 3 weeks ago

Error in full:

Duplicate file path seen in the Iceberg metadata snapshot. Please check that your Iceberg metadata generation is producing valid manifest files and refresh to a newer snapshot once fixed. File path:'/catalog>/<table>/data/<filename>.parquet', SnapshotId: '<snapshot_ID>’.

Can confirm that the tabular connector is periodically writing out duplicate filepaths in the snapshots. I used the current manifest file and found the snapshot ID referenced in the error. This snapshot ID pointed to an avro file in it's "manifest-list" key. I opened that file and found 4 objects pointing to different metadata avro files. I opened the first one which had 4 objects, 2 sets of duplicates. One of the pairs pointed to the parquet file that was referenced in the error.

Tabular connector had written duplicate filepaths.

With snapshot retention set to a minimum of 1 day, that means whenever this happens my iceberg table will not be queryable for 24 hours.

This is a problem.