databricks / iceberg-kafka-connect

Apache License 2.0
213 stars 47 forks source link

Duplicates using upsert+PK #260

Open aurany opened 4 months ago

aurany commented 4 months ago

I'm using Debezium and Cassandra in CDC mode with snapshots enabled. I see that when Cassandra makes a snapshot it also sends regular inserts. In Kafka and in the final Iceberg table they end up as dups and I assume this is expected since it actually receives duplicate inserts? Or is there a recommended way of handling this? The only solution I can think of now is to process snapshots and non-snapshots in different pipelines..

tabmatfournier commented 4 months ago

You may find this useful: https://tabular.io/webinar-everything-cdc/

typically most people take a two-table approach: the raw events to the first table (which can contain multiple rows per PK because of CDC updates, etc) and then run a secondary process that does some sort of MERGE INTO a second table so there is only the latest value for PK.

aurany commented 4 months ago

Thank you, the link was very useful! I guess that the upsert functionality is not needed then.

tabmatfournier commented 4 months ago

In my experience upsert only works for small tables and requires aggressive behind the scenes compaction, which can interface with other operations on the table. For a sizable table, the query performance was always poor. Which is why most folks go the two-table route.