confluentinc / ducktape

System integration and performance tests
11 stars 93 forks source link

Remove bad nodes from cluster during the test execution #325

Closed stan-is-hate closed 2 years ago

stan-is-hate commented 2 years ago

Summary If the node stops responding, we don't want to schedule any more tests on it. The way we do this is we ping every node as we allocate a subcluster for the test. If the node does not respond, we remove it from the cluster's available nodes permanently.

Implementation notes I've considered multiple ways of implementing this. The cleanest by far would be to move remove_spec and its "sister" methods out from node_container and into the cluster.

However, it would also be the most disruptive, especially if other people are using their own custom cluster implementations.

Instead, I've opted to do a relatively minimalistic patch to node_container, cluster implementations and runner. The core of the change is in the node_container.remove_spec() method, where we will now check node health before allocating it - if the node if of type RemoteAccount. It will return list of good nodes and a list of bad nodes, or raise an exception with bad nodes included as part of the exception object.

I'm not super happy about this implementation for two reasons:

Testing Tested with unit tests and live runs with vagrant.

stan-is-hate commented 2 years ago

Oh, I missed this question:

One question I do have is how are these unscheduled tests reported? Should we create a new category for unscheduled?

They are reported as FAILED, same as before - any test that doesn't fit existing cluster will be marked as failed, and the message in the failure should include the reason (ie not enough nodes available)

I haven't changed the code to report unscheduled tests, I've only made it run every time a cluster changes size, rather than once at the beginning.