Node health checks during pre-allocation takes time and leads to runner client timeouts

Symptoms Unable to receive response from driver when running a lot of tests in parallel, esp. if the tests take a lot of nodes

Cause This error is thrown by the runner_client when it sends the request to the runner and doesn't get the response back in time. Timeout is 3 seconds, plus 5 retries, so 15 seconds total.

The issue started happening when v0.10.0 introduced health checks for nodes to remove the ones that fail this check from the cluster - with the intention of not scheduling tests on unresponsive nodes.

However, when scheduling multiple tests in a row (--max-parallel flag) and if each tests takes multiple nodes, this process can take some time.

Each test that runner triggers is spawning a runner_client process, which then sends back ready message. Due to the design of the runner, it doesn't attempt to receive messages until it finishes scheduling all available tests, so the first test that was triggered won't receive the response to its ready message until the runner finishes scheduling all remaining tests. And since runner is pinging each node for each test, that process can theoretically take longer than 15 seconds thus leading to the first runner_client timing out waiting for response from runner.

Mitigation We've already removed v0.10.0 from PyPI.

We've also updated internal release process guide to include running full suite of Apache Kafka tests before each ducktape release. We will move such guide to the official documentation or github wiki in the near future.

We will be releasing v0.10.1 shortly which disables health checks for nodes, to ublock master branch development (master branch has since moved on and more PRs have been merged)

Long-term solutions Details TBD, current idea is to process zmq responses in the separate thread in the runner, so that scheduling is kept synchronous, but responses are handled in an async fashion. This should likely work, but any async and threading work can lead to more issues, so care needs to be taken.

cc @confluentinc/quality-eng

confluentinc / ducktape

Node health checks during pre-allocation takes time and leads to runner client timeouts #339