Resolver should die if it cannot make progress for a long time

apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store

https://apple.github.io/foundationdb/

Apache License 2.0

14.2k stars 1.3k forks source link

Resolver should die if it cannot make progress for a long time #4972

Closed sfc-gh-tclinkenbeard closed 2 years ago

sfc-gh-tclinkenbeard commented 3 years ago

If the resolver is unable to serve resolution requests from proxies for a very long time, this will not necessarily cause a recovery, so the cluster can indefinitely remain in a state where commits fail (until a manual recovery is forced). Instead, the resolver should automatically detect that it is unhealthy, and trigger a recovery itself. We should also improve availability testing in simulation to ensure that this case is covered.

sfc-gh-tclinkenbeard commented 2 years ago

This is addressed by https://github.com/apple/foundationdb/pull/5231.