As part of a different PR, I found an issue in urgent consistency checker where because of its handling of duplicate tester requests, the tester recruitment endpoint was marked as permanently failed. This means that subsequent phases e.g. consistency check were stuck since that phase relies to get interfaces for all testers. To fix this, I changed the duplicate request handling in urgent consistency checker to be more explicit: the tester server drops the requests and explicitly sends a duplicate error back to the client. This way, the endpoint is not marked as permanently failed since no promise was broken.
After the change in (1), the urgent consistency checker itself started becoming stuck for some tests. After debugging, it was because the shards/batch mapping to tester ids had edge cases such that we keep picking the same testers which were not able to serve the requests (because they could be getting duplicate error back). To fix this, I changed the mapping algorithm to be deterministically random i.e. we don't pick the same testers for the same shards again and again, with high probability. Details in the code comments.
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
[ ] The PR has a description, explaining both the problem and the solution.
[ ] The description mentions which forms of testing were done and the testing seems reasonable.
[ ] Every function/class/actor that was touched is reasonably well documented.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
[ ] This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
[ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)
As part of a different PR, I found an issue in urgent consistency checker where because of its handling of duplicate tester requests, the tester recruitment endpoint was marked as permanently failed. This means that subsequent phases e.g. consistency check were stuck since that phase relies to get interfaces for all testers. To fix this, I changed the duplicate request handling in urgent consistency checker to be more explicit: the tester server drops the requests and explicitly sends a duplicate error back to the client. This way, the endpoint is not marked as permanently failed since no promise was broken.
After the change in (1), the urgent consistency checker itself started becoming stuck for some tests. After debugging, it was because the shards/batch mapping to tester ids had edge cases such that we keep picking the same testers which were not able to serve the requests (because they could be getting duplicate error back). To fix this, I changed the mapping algorithm to be deterministically random i.e. we don't pick the same testers for the same shards again and again, with high probability. Details in the code comments.
100K Joshua:
20241025-215413-praza-ebe4b5af5aced377ce58d158106b23de790ee4 compressed=True data_size=35842098 duration=5463198 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:01:54 sanity=False started=100000 stopped=20241025-225607 submitted=20241025-215413 timeout=5400 username=praza-ebe4b5af5aced377ce58d158106b23de790ee4e9
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)