apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.5k stars 1.31k forks source link

Urgent consistency checker fixes #11734

Closed spraza closed 1 week ago

spraza commented 1 week ago
  1. As part of a different PR, I found an issue in urgent consistency checker where because of its handling of duplicate tester requests, the tester recruitment endpoint was marked as permanently failed. This means that subsequent phases e.g. consistency check were stuck since that phase relies to get interfaces for all testers. To fix this, I changed the duplicate request handling in urgent consistency checker to be more explicit: the tester server drops the requests and explicitly sends a duplicate error back to the client. This way, the endpoint is not marked as permanently failed since no promise was broken.

  2. After the change in (1), the urgent consistency checker itself started becoming stuck for some tests. After debugging, it was because the shards/batch mapping to tester ids had edge cases such that we keep picking the same testers which were not able to serve the requests (because they could be getting duplicate error back). To fix this, I changed the mapping algorithm to be deterministically random i.e. we don't pick the same testers for the same shards again and again, with high probability. Details in the code comments.

100K Joshua:

20241025-215413-praza-ebe4b5af5aced377ce58d158106b23de790ee4 compressed=True data_size=35842098 duration=5463198 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:01:54 sanity=False started=100000 stopped=20241025-225607 submitted=20241025-215413 timeout=5400 username=praza-ebe4b5af5aced377ce58d158106b23de790ee4e9

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-arm on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

kakaiu commented 1 week ago

LGTM. Thank you for the fix!

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-arm on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented 1 week ago

Result of foundationdb-pr on Linux CentOS 7