Open vyzo opened 8 years ago
While I'm not arguing against the the fact that system invariants should be maintained irrespective of how long snapshots take, I'm wondering how sustainable it is to have a system where snapshot times can grow indefinitely. Even if the underlying issue is fixed you might still have a system that is unstable. As you are probably aware larger election timeouts have other negative side effects such as the system taking too long to recover from legitimate leader failures.
AFAIK, snapshotting is more appropriate when when the underlying state does not grow indefinitely. For example: a counter.
We have run into an issue that eventually leads to state divergence in a cluster, as snapshot times get longer and longer.
We are using a snapshottable state machine implementation.
The state machine maintains a growing map which is part of the snapshot, and we have noticed that as the snapshot time increases, the cluster eventually hits a sequence of events that ends with the divergence of a cluster member. We have observed that all long runs in our tests eventually lead to this state.
The sequence of events extrapolated from our logs is as following (in a cluster with 3 copycats):
Our distilled logs from from runs causing divergence have the following form around the time of divergence:
The only mitigation we have against this problem is to increase the election timeout, so that it is longer than the snapshot time, and try to reduce that time with custom serialization. Indeed, we have observed that when snapshot times don't exceed the timeout, the state remains identical in all members of the cluster. This is not a long term solution however, because snapshot time will eventually exceed the election timeout again as our system grows.