MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.66k stars 457 forks source link

Reduce Debezium cost with idempotency key #27849

Open chuck-alt-delete opened 1 week ago

chuck-alt-delete commented 1 week ago

Feature request

Debezium offers before and after images for record updates which we can use to issue diffs in Materialize. The issue is that there are certain failure scenarios that completely break this mechanism, eg

  1. Forgotten deletes. Debezium will not be able to see deletes while it is down during a snapshot. So Materialize will have this record even though it has been deleted, which is incorrect.
  2. Duplicates. Restarts in debezium can cause duplicates, which means the after image of a record will not match the before image of the next record of the same key, leading to an errored source in Materialize.

As far as I know, 1 is not solved by the current upsert operator anyway, so we can exclude it from this discussion.

For 2, we currently use an upsert operator to deduplicate and create a canonical before/after image for each key that we use to issue good diffs. This comes at a cost — MZ has to remember this information for all keys at all times.

In practice, duplication happens very rarely and in a specific way. Duplicates don’t happen at random times — they often within a very limited timeframe. If we had a short-lived (say, 1 hour) idempotency key that rejects records that were recently received, we would eliminate virtually all duplicates without paying such a high memory cost.

benesch commented 1 week ago

Thanks for writing this up, Chuck! I think it's absolutely great to ideate on bringing down the cost of upsert sources, and I think there's something quite interesting to explore along these lines, but I wanted to flag that this specific proposal makes me uncomfortable. It feels generally off brand for us to offer correctness with caveats, and even if we did feel this was an appropriate place to sacrifice correctness in the name of cost/performance, it feels like this proposal would result in a level of correctness that is hard to explain and would thus result in an educational and support burden.

If we had a short-lived (say, 1 hour) idempotency key that rejects records that were recently received, we would eliminate virtually all duplicates without paying such a high memory cost.

How would this work when recreating a source on top of a Kafka topic that Debezium has been writing to for a while? As described I think this only works for deduplicating in the steady state.

You could alternatively keep the idempotency key around for some number of Kafka offsets—e.g., idempotency keys are forgotten after 10k or 100k or 1MM additional records are ingested. That's at least a deterministic function of the Kafka topic. But then users would need to estimate the maximum number of records that Debezium would ever duplicate, which seems hard to reason about in a different way.

chuck-alt-delete commented 1 week ago

You’re welcome! The correctness discomfort is totally fair.

How would this work when recreating a source on top of a Kafka topic that Debezium has been writing to for a while?

I was thinking these keys would be created “lazily” as data is read, so the 1 hour (or 5 hours or whatever) in this case would start when the source is created. In this case, the initial memory would be just as bad as upsert though because we’d probably read all or nearly all the keys in the compact topic in that time, but the difference would be that you could size down after the initial load.