Fault Tolerance during DDL

MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.

https://materialize.com

Other

5.72k stars 466 forks source link

Fault Tolerance during DDL #12981

Open jkosh44 opened 2 years ago

jkosh44 commented 2 years ago

What version of Materialize are you using?

v0.23.1-dev

How did you install Materialize?

Built from source

What is the issue?

While performing DDL, Materialize makes multiple network calls to STORAGE, COMPUTE, and Stash during catalog_transact: https://github.com/MaterializeInc/materialize/blob/cf9df56b39d8e63e20e4023739f323481b7d2f21/src/coord/src/coord.rs#L4728-L4771 Additionally, some callers of catalog_transact make additional network calls after the method returns. If the Coordinator crashes after some of these network calls have competed but before they've all finished then it's likely that the deployment will be left in an inconsistent state.

Making these calls idempotent and implementing some form of WAL can help us properly recover after a crash.

Relevant log output

No response

Part of #13204

sploiselle commented 2 years ago

Where do the network calls occur during catalog.transact?

jkosh44 commented 2 years ago

Where do the network calls occur during catalog.transact?

I think the Stash transaction happens in catalog.transact, which may or may not involve network calls depending on the stash implementation. Either way it involves a durable state change which is probably what I should have said instead of "network calls".

This may just all be part of the command reconciliation work.

maddyblue commented 2 years ago

All stash operations are atomic, and they happen first, so we only need to worry about things after the stash transaction commits.

sploiselle commented 2 years ago

Ah, ok--I am under the impression this is the purpose of command reconciliation

jkosh44 commented 2 years ago

Now that I've slept on it, I agree, this falls under command reconciliation.

maddyblue commented 2 years ago

Reopening because command reconciliation doesn't quite fix this. It could leave dropped things orphaned (quite bad for cluster replicas which incur expense), which could be solved by some bootstrap thing that cleans up orphans. If we have to do that work, we might as well implement a better solution that doesn't create orphans in the first place: log intended side effects in the same stash txn, then have a thing that processes it.

jkosh44 commented 2 years ago

I think one potential solution to this is to implement some simplified form of ARIES. Where we log everything to a WAL like @mjibson suggests, and then on startup we replay the WAL.

In general I think we should have some form of recovery when the Coordinator starts up. This would also help with any unidentified fault tolerance issues outside of DDL.

jkosh44 commented 2 years ago

Another issue that manifests from this is if we fail after creating some object but before initializing that objects read policy. Then the object will essentially cause a memory leak wherever it lives.