Open jkosh44 opened 2 years ago
Where do the network calls occur during catalog.transact
?
Where do the network calls occur during
catalog.transact
?
I think the Stash transaction happens in catalog.transact
, which may or may not involve network calls depending on the stash implementation. Either way it involves a durable state change which is probably what I should have said instead of "network calls".
This may just all be part of the command reconciliation work.
All stash operations are atomic, and they happen first, so we only need to worry about things after the stash transaction commits.
Ah, ok--I am under the impression this is the purpose of command reconciliation
Now that I've slept on it, I agree, this falls under command reconciliation.
Reopening because command reconciliation doesn't quite fix this. It could leave dropped things orphaned (quite bad for cluster replicas which incur expense), which could be solved by some bootstrap thing that cleans up orphans. If we have to do that work, we might as well implement a better solution that doesn't create orphans in the first place: log intended side effects in the same stash txn, then have a thing that processes it.
I think one potential solution to this is to implement some simplified form of ARIES. Where we log everything to a WAL like @mjibson suggests, and then on startup we replay the WAL.
In general I think we should have some form of recovery when the Coordinator starts up. This would also help with any unidentified fault tolerance issues outside of DDL.
Another issue that manifests from this is if we fail after creating some object but before initializing that objects read policy. Then the object will essentially cause a memory leak wherever it lives.
What version of Materialize are you using?
v0.23.1-dev
How did you install Materialize?
Built from source
What is the issue?
While performing DDL, Materialize makes multiple network calls to STORAGE, COMPUTE, and Stash during
catalog_transact
: https://github.com/MaterializeInc/materialize/blob/cf9df56b39d8e63e20e4023739f323481b7d2f21/src/coord/src/coord.rs#L4728-L4771 Additionally, some callers ofcatalog_transact
make additional network calls after the method returns. If the Coordinator crashes after some of these network calls have competed but before they've all finished then it's likely that the deployment will be left in an inconsistent state.Making these calls idempotent and implementing some form of WAL can help us properly recover after a crash.
Relevant log output
No response
Part of #13204