databricks / iceberg-kafka-connect

Apache License 2.0
219 stars 49 forks source link

Is there a way to resume snapshot creation after removing the control topic? #187

Closed dshma closed 9 months ago

dshma commented 9 months ago

Hi there,

Recently accidentally removed the control-iceberg topic and after a while noticed that snapshots/manifests stopped generating. Metadata (metadata.json) and data (parquet) files continue to be created, but Trino, for instance, won't see the actual data that got committed after that since the metadata has no reference to corresponding snapshot (manifest-list) file.

Perhaps this is something known/expected and there is a way to restore from such a state?

BTW: After running Trino optimize + expire_snapshots + remove_orphan_files for the table, snapshots are being created starting with a new commit, but there is definitely a data loss within the specified retention_threshold.

Thanks!

bryanck commented 9 months ago

You can recover from this by resetting the offsets, if you have enough retention in Kafka. You'll need to do this in two consumer groups, the Kafka Connect managed CG (by default named "connect-" plus connector name) and the sink managed CG (by default named "cg-control-" plus connector name). Once the data is processed you'll need to dedupe.

dshma commented 9 months ago

Thanks Bryan, yeah, it makes sense, having enough retention could also allow to achieve the same using the RemoveOrphanFiles action, and what about when data has been deleted from Kafka? It's actually still present in our object storage (s3), is there maybe a proper way (manually, as an option) to update/create metadata/manifest files so that they reference the appropriate data files?

dshma commented 9 months ago

hey @bryanck, a gentle reminder on the question above just in case it may have been missed, thank you!

bryanck commented 9 months ago

You will want to roll back the offsets to a period before the problem occurred. Then restart the sink. You will end up having a window where the new data written overlaps with the existing data, i.e. you will have dupes. There are various ways you could dedupe, one way is to overwrite any partitions that could have dupes with deduped data with something like Spark. Iceberg will handle updating the file references in the metadata.

dshma commented 9 months ago

Gotcha, appreciate your responsiveness, I guess I'll close the ticket with the comment since it was more like a question than an issue. Thank you!