materialization: materialize CDC format

Background

Materialize is an interesting materialization target we'd like to explore.

It has an idealized change-data-capture format that has it's baking in the way timely dataflow works. Here's a longer blog post with more background. tl;dr is that Materialize wants to consume updates like:

update (record0, time0, +2)
update (record1, time0, +1)
update (record2, time0, +1)
progress (time0, 3)
update (record1, time1, -1)
update (record2, time1, +1)
progress (time1, 2)
update (record0, time2, -1)
update (record2, time2, -1)
progress (time2, 2)
progress (time3, 0)

Materialize innately wants consistent differential delta updates of complete rows. record0, record1, etc are restatements of a record as it existed at a point in time, and the +1 / -1 are updates of that record in a multi-set -- which is differential dataflow's fundamental abstraction. Given a consistent and correct CDC stream of this kind, Materialize is able to avoid storing the source table in memory and operate only on the change stream, which I understand to be the desired ideal.

However CDC solutions don't provide that for various reasons. Either because of weak consistency semantics, or because it's not available in the logical log. For example:

The Materialize Postgres source must keep the entire source table indexed, presumably to handle retractions.
- It's support for the Debezium envelope notes that duplicates can happen and support is best-effort. There are additional caveats around log compaction violating correctness.

A Flow + Materialize solution therefor looks interesting for a few reasons:

CDC captures that include only delta changes, like Postgres, can be combined with Registers to produce full and consistent accountings of changes in update (record, +1/-1) terms.
- This avoids Materialize having to keep an in-memory copy of the source. Flow can maintain those states as a HA + scale-out concern.
The +1/-1 differential deltas could be managed as reduction annotations.
- This allows Flow to substantially roll-up of changes before pushing them to Materialize, which is almost certainly important for new materializations of long-lived CDCs.

Assumptions

1) We don't want to pass-through a first-class notion of Flow's internal CDC log time, because it's not helpful to tell Materialize about timepoints in the distant pass. In other words, time0, time1, etc should always be ~now. This is because Materialize (and timely dataflow as a whole) have a global understanding of the marching progress of time. You can't usefully tell it timepoints of a months-old capture that's just been stood up as a new source. This may be wrong - I recall that hyper-times are possible in Timely Dataflow but that may not be in Materialize just yet.

2) Materialize sources are all pull-oriented in nature. A successful integration would require a push API where Flow tells Materialize of update and progress messages, and receives ACKs back in reply. An ACK means that Materialize has retained all CDC records through the ACK and is able to re-play them on its startup, as happens internally with current native sources. I assume this is feasible. The API could be either streaming -- which would require explicit ACKs -- or synchronous, such as an HTTP PUT with ACK being conveyed by OK response.

3) Breaking Change / Contentious if Materialize sees a progress at time1, then updates at time2, and then another progress at time1 -- it must treat the second progress at time1 as a roll-back and discard the intermediary time2 updates.

If 1) above is correct, then this is going to be required by any exactly-once integration with a streaming system, because that system a) must align its own transaction boundaries with Materialize "progress" updates for correctness, but also b) dynamically sizes its transaction boundaries based on stream behavior (like data stalls).

If 1) is incorrect, than the stream itself could encode repeatable times sent as update / progress. However, this still precludes an important optimization: being able to roll-up / compact a larger set of differential updates into a smaller set that's actually sent to Materialize.

To illustrate the specific concern, suppose that Materialize instead decided to discard all but the last-received update of a record at timeT. This is reasonable at first blush, because if transactions are consistent on their selection of timeT, than a failed transaction would presumably be followed by a re-statement of the record at timeT. However this fails to account for the second, post-recovery transaction being smaller than the transaction that failed. In other words, a transaction could read through offset N of a stream and then fail after sending updates at timeT. A next transaction could read through offset M (< N) and also send updates at timeT -- but would fail to account for records between M:N and these records would effectively be sent twice.

Desired Outcome

We can PoC a materialization connector and end-to-end example which drives a proposed / currently-non-existent Materialize push API. It would:

Perform CDC using the Postgres capture connector across two example tables A & B.
Apply derivations to map CDC records into differential deltas of A & B.
Materialize using the proposed connector, driving a stand-in / no-op API to which it POSTs.

The materialization connector would side-step table creation in Materialize for now -- we'd need to coordinate how that would work, and it's more important to show the live CDC log.

Connector implementation sketch

The connector supports only delta-updates, and assumes (& eventually verifies) that source collections are ~schematized as:

{"record":{"the":"record"},"delta":-2}

record also forms the collection key -- each distinct tuple of record fields is its own key.
delta has a sum reduction strategy.

State

The connector uses driver checkpoints managed in recovery logs, and doesn't rely on transactions within materialize (not supported).

The driver checkpoint is:

{
  progress: {timestamp, num-records},
  next: timestamp,
}

progress is an intention of a progress update which will be written once a current transaction has been committed to the materialization's recovery log. On startup, the first thing a connector does is to tell Materialize of "progress" in its recovered checkpoint. This ensures we cannot fail to deliver "progress" if a txn committed, and is the same technique used for ACK intents within Gazette's exactly-once semantics.

next is the timestamp to use in the next transaction. The rationale for committing it into the driver state, rather than a transaction independently picking the current time, is that "update" messages of a failed transaction are guaranteed to be repeated by a successor transaction. If assumption 3) doesn't pan out then Materialize would at least discard the earlier "updates" which are explicitly repeated (TODO this is unsatisfying and still needs design work).

Transaction lifecycle:

The Load phase is a no-op ("delta updates" mode).
The Store phase:
- Sends the progress intent recorded in the last driver checkpoint.
  - At this point we know for sure that the prior txn has committed to the recovery log.
- Sends updates to Materialize reshaped as {record, timestamp, delta}, using the next timestamp from the last checkpoint.
- It tracks the total number of documents sent (it will be one per record tuple).
The Prepare phase emits a driver checkpoint update that:
- Sets the progress to be written (former next timestamp and number of sent records).
- Sets the next timestamp to be used in the next transaction using the current wall-time.
  - The (async) Commit phase blocks until the "progress" and "updates" written during the Store phase have been ACK'd.

After looking at this and reviewing the Materialize CDC format, I come out with the following question... Is the Materialize Time value really important as an actual time value or is it more important that it's just a monotonically increasing value. It seems to me that if we could come up with a function of our checkpoint value + key ranges that would produce an increasing value, we could use that in place of the Materialize time value. The only issues that I can think of is that multiple key ranges working on the same checkpoint might force a re-ordering in Materialize if they come in out of order.

Example, if you have 4 shards working on the same materialization, each shard would get time 1,2,3 and 4. It could provide the progress update for it's relevant time (1-4) If shard 3 finished before shard 2 though, when shard 2 pushed it's update it would make Materialize re-order those updates (even though in theory that's not necessary) I think that's part of the Materialize magic though that it can apply those efficiently.

So questions to be answered I think are

Can the time value in Materialize only used for ordering (and not actually relevant to a time)? (And doesn't matter what the time is really)
Can we produce that number across shards such that each checkpoint from the source corresponds to an increasing values across shards in the materializations?
Is there any sort of significant performance penalty if the time values, despite always increasing come in out of order.

Okay, I spent a bunch of time spelunking through the codebase and engineering design docs of materialize, and learned a lot. Top of mind:

My understanding of how handles repeatable timestamp reads for sources was very off: it does not store a cached copy on the local disk. That's on their roadmap but is likely far off. Instead, it always always always pulls from an external system of record to re-hydrate its internal states. My presumption above & thus far -- that there would be some kind of internal cache representation to directly integrate into -- is unfounded.
Instead what it does track -- and this is quite clever -- is a compactable horizon of source offset ranges and associated wall timestamps at which they were first read. On restart, it will re-pull from the source but map record offsets through an index to re-associate a read timestamp. This index is smallish and gets stuffed into a SQLite database with the rest of the DB catalog.
The FILE source now supports AVRO OCF sources -- including Debezium envelopes! (I think this is new since I last looked?). File sources assign a ~now timestamp to each record on read, which is persisted using ⬆️ this index. On restart, they re-read the file from byte zero and then tail.
Reading code, DebeziumDeduplicationStrategy::Ordered is efficient: it trusts the high water mark of "binlog" operations within the envelopes, and simply discards those having a lower binlog ordering.
This is what ENVELOPE DEBEZIUM maps to (which is supported by the FILE source today) -- ENVELOPE DEBEZIUM UPSERT is the less efficient version which keeps a full index so it can tolerate Kafka compactions, etc.
Here's it's decoding and interpretation of debezium envelopes which maps to assigned "binlog" positions in that ordering.
Most Materialize sources have no real parallelism and seem to be MVP's / oriented for demos. The Kinesis source, for example, has a single task that polls all Kinesis shards. Kafka appears to be the only one which does (and is likely the only truly "productionized" source?). The Postgres design doc notes that Kafka + Debezium is recommended for production. On every restart, it drops its replication slot, re-creates it, and backfills the whole table. It also requires replication FULL in the source database, which adds overhead, and if you add an additional source (say, to bring in new tables), it duplicates the ongoing WAL read.

A sketch that's taking shape in my head:

Materialize needs an external system of record, and Gazette journal(s) can provide it. I'd bet even a single journal can provide plenty of I/O for 80%+ of use cases, and we can add parallelism across journals later.
CREATE SOURCE ... FROM AVRO OFC .. ENVELOPE DEBEZIUM looks pretty ideal as a mechanism for a meaningful PoC that doesn't require any modifications to materialize itself.
We can combine gazctl journals read -l name=the/journal with a Linux pipe, so that the file being read by the source is in fact directly pulling from gazctl (with back-pressure), which in turn is directly pulling from brokers and cloud storage.
- Materialize instances on different machines would read identical bytes, while only fetching data as needed, and without any local copies. This is the ideal outcome for them, I'd think.
- gazctl already has robust handling for delegated reads to cloud storage (faster, go directly to the data in the bucket) and restarted reads due to spurious network failures. I have confidence in using it in this way.

Okay, now the harder parts:

We'd need a materialization connector which produces delta updates of materialized documents in Avro OCF format, into a Gazette journal.
- It would need to map the collection JSON schema into an Avro schema, and have some support for the debezium envelope -- notably for generating lsn and sequence fields which are used for materialize's binlog ordering.
The Avro OCF format includes a header containing the schema.
- The encoder we'd likely use would write this header on every open. That seems problematic, and we probably want to give it an io.Writer we can turn on & off?
- Should we write the header exactly one time, at byte offset 0, during the materialization Apply RPC ? Or should we not write it at all, and compose gazctl journals read with something that will first prepend the header ?

(I think we can get started with just the above, writing raw data directly into a Gazette journal it creates on demand as part of its Apply RPC. However, to do this 100% correctly, we need a bit more):

The materialize connector must produce with exactly-once semantics into its chosen journal.
- That means the Avro messages must also include a Gazette message UUID, AND
- The connector must write messages as uncommitted, then store ACK intents into its driver checkpoint, and then use the proposed materialization Acknowledge/Acknowledged message loop to understand when it may write ACKs.
It also means we must add flagged support for ReadCommittedIter into gazctl journals read
... which also means that it must have some native understanding of journals with Avro Content-Encoding, so it can figure out how to extract the UUID for sequencing.
And finally, it's existing --offsets / --offsets-out read checkpoint mechanisms must be extended to support message sequencer states.

Next steps

There's some lingering "can this even work?" risk in the plan above that we can cheaply verify:

Create a test binary that writes a fake, unbounded, but deterministic dataset in an Avro OCF format + debezium envelope to stdout.
Combine it with a rate limiter like pv, and stream it into a linux pipe.
Create a FILE source using the other end of the pipe.
Does it load as expected into materialize? Can we build a view on it?
If we stop and start materialize and the pipe, does it restore to the same state?

If we can demonstrate a capability to feed materialize in this way, I have confidence in the rest of the design sketch.

@jgraettinger I currently have writing a static file in Avro OCF format in the Debezium envelope and having it be read by Materialize. It is working but not totally as I would expect. I need to research that more and see if it's a problem with my format or something else I am doing. (Essentially writing a bunch of values, create a materialized view with the average of those values and I'm getting the last value instead of the average value)

Unfortunately I don't think a pipe is going to work for this case. As part of opening the OCF file/pipe it appears to open it and then immediately closes it thus ending with a broken pipe error. Perhaps it's just reading the schema before re-opening the file. Not sure but at any rate I tried a combination of things to get it to continue to read but it doesn't seem to work. I've tried immediately re-opening the pipe but it's disconnected at that point and not pulling anything else. I've also tried the Materialize TAIL option as well as not including it but it doesn't seem to help.

The only other thing I could think to try is emulating a Kafka broker/server to feed it with data. This is probably not a trivial solution but could be useful for other destinations such as Clickhouse which I know also prefers to read data from Kafka.

Yea, it opens the file to read the Avro OCF header, and will close it again on drop, before opening it for real to read data.

It may still be possible by getting this working by getting a little lower level. Specifically, the writer to the pipe could handle the returned SIGPIPE error by closing its descriptor and immediately open it again for writing. Essentially it restarts itself as soon as the reader goes away. Not 💯 this can work but I don't see why not.

This actually appears to be a pretty decent starting point for speaking the Kafka wire protocol. https://github.com/travisjeffery/jocko I've already got it to the point where's trying to communicate and decoding the first message. I just need to follow the code and feed it everything it needs at this point.

Hi I am now migrating the DynamoDB to Postgres by using estuary. But Postgres connector is not well. How to set the Postgres address(host and port)? Postgres server is running on my local PC.

Please help me asap. Thanks

Hi @melody413,

We're happy to help you here! We're a cloud service so we generally connect to endpoints that can be accessed via the internet. If you have a database that's on localhost, the best bet is to set up a SSH tunnel as defined here in our docs.

Let me know if you have any questions!

-Dave

I'm going to close this out since I think the original intent has been accomplished through the implementation of the Kafka read gateway for Flow, and additional work on that is tracked separately.

estuary / connectors