Open aurany opened 5 months ago
You may find this useful: https://tabular.io/webinar-everything-cdc/
typically most people take a two-table approach: the raw events to the first table (which can contain multiple rows per PK because of CDC updates, etc) and then run a secondary process that does some sort of MERGE INTO a second table so there is only the latest value for PK.
Thank you, the link was very useful! I guess that the upsert functionality is not needed then.
In my experience upsert only works for small tables and requires aggressive behind the scenes compaction, which can interface with other operations on the table. For a sizable table, the query performance was always poor. Which is why most folks go the two-table route.
I'm using Debezium and Cassandra in CDC mode with snapshots enabled. I see that when Cassandra makes a snapshot it also sends regular inserts. In Kafka and in the final Iceberg table they end up as dups and I assume this is expected since it actually receives duplicate inserts? Or is there a recommended way of handling this? The only solution I can think of now is to process snapshots and non-snapshots in different pipelines..