Closed dshma closed 9 months ago
You can recover from this by resetting the offsets, if you have enough retention in Kafka. You'll need to do this in two consumer groups, the Kafka Connect managed CG (by default named "connect-" plus connector name) and the sink managed CG (by default named "cg-control-" plus connector name). Once the data is processed you'll need to dedupe.
Thanks Bryan, yeah, it makes sense, having enough retention could also allow to achieve the same using the RemoveOrphanFiles action, and what about when data has been deleted from Kafka? It's actually still present in our object storage (s3), is there maybe a proper way (manually, as an option) to update/create metadata/manifest files so that they reference the appropriate data files?
hey @bryanck, a gentle reminder on the question above just in case it may have been missed, thank you!
You will want to roll back the offsets to a period before the problem occurred. Then restart the sink. You will end up having a window where the new data written overlaps with the existing data, i.e. you will have dupes. There are various ways you could dedupe, one way is to overwrite any partitions that could have dupes with deduped data with something like Spark. Iceberg will handle updating the file references in the metadata.
Gotcha, appreciate your responsiveness, I guess I'll close the ticket with the comment since it was more like a question than an issue. Thank you!
Hi there,
Recently accidentally removed the control-iceberg topic and after a while noticed that snapshots/manifests stopped generating. Metadata (metadata.json) and data (parquet) files continue to be created, but Trino, for instance, won't see the actual data that got committed after that since the metadata has no reference to corresponding snapshot (manifest-list) file.
Perhaps this is something known/expected and there is a way to restore from such a state?
BTW: After running Trino optimize + expire_snapshots + remove_orphan_files for the table, snapshots are being created starting with a new commit, but there is definitely a data loss within the specified retention_threshold.
Thanks!