Remove bad nodes from cluster during the test execution

Summary If the node stops responding, we don't want to schedule any more tests on it. The way we do this is we ping every node as we allocate a subcluster for the test. If the node does not respond, we remove it from the cluster's available nodes permanently.

Implementation notes I've considered multiple ways of implementing this. The cleanest by far would be to move remove_spec and its "sister" methods out from node_container and into the cluster.

However, it would also be the most disruptive, especially if other people are using their own custom cluster implementations.

Instead, I've opted to do a relatively minimalistic patch to node_container, cluster implementations and runner. The core of the change is in the node_container.remove_spec() method, where we will now check node health before allocating it - if the node if of type RemoteAccount. It will return list of good nodes and a list of bad nodes, or raise an exception with bad nodes included as part of the exception object.

I'm not super happy about this implementation for two reasons:

NodeContainer behaves differently depending on whether in contains ClusterNode or RemoteAccount objects, and may do health checks or not

Testing Tested with unit tests and live runs with vagrant.

confluentinc / ducktape

Remove bad nodes from cluster during the test execution #325