apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store

https://apple.github.io/foundationdb/

Apache License 2.0

14.5k stars 1.31k forks source link

Urgent consistency checker fixes #11734

Closed spraza closed 1 week ago

spraza commented 1 week ago

As part of a different PR, I found an issue in urgent consistency checker where because of its handling of duplicate tester requests, the tester recruitment endpoint was marked as permanently failed. This means that subsequent phases e.g. consistency check were stuck since that phase relies to get interfaces for all testers. To fix this, I changed the duplicate request handling in urgent consistency checker to be more explicit: the tester server drops the requests and explicitly sends a duplicate error back to the client. This way, the endpoint is not marked as permanently failed since no promise was broken.
After the change in (1), the urgent consistency checker itself started becoming stuck for some tests. After debugging, it was because the shards/batch mapping to tester ids had edge cases such that we keep picking the same testers which were not able to serve the requests (because they could be getting duplicate error back). To fix this, I changed the mapping algorithm to be deterministically random i.e. we don't pick the same testers for the same shards again and again, with high probability. Details in the code comments.

100K Joshua:

20241025-215413-praza-ebe4b5af5aced377ce58d158106b23de790ee4 compressed=True data_size=35842098 duration=5463198 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:01:54 sanity=False started=100000 stopped=20241025-225607 submitted=20241025-215413 timeout=5400 username=praza-ebe4b5af5aced377ce58d158106b23de790ee4e9

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

[ ] The PR has a description, explaining both the problem and the solution.
[ ] The description mentions which forms of testing were done and the testing seems reasonable.
[ ] Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

[ ] This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
[ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 0:21:46
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 0:43:08
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 0:50:50
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 0:53:48
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 0:58:16
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 1:10:47
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr on Linux CentOS 7

Commit ID: ebe4b5af5aced377ce58d158106b23de790ee4e9
Duration 1:20:49
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 0:22:00
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

kakaiu commented 1 week ago

LGTM. Thank you for the fix!

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 0:48:24
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 0:50:27
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 0:51:29
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 0:52:56
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 1:06:05
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci commented 1 week ago

Result of foundationdb-pr on Linux CentOS 7

Commit ID: b372b0144f9eca6d5fdff703f284096e69df0ad6
Duration 1:12:47
Result: :white_check_mark: SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)