Broker draining - Githubissues

Overview

In certain situations a broker will be temporarily unavailable:

EnMasse upgrade (rolling upgrade of the broker deployments)
Pod eviction by the Kubernetes scheduler (e.g when a node is being drained for maintenance)

The messages that were stored on that respective broker will be temporarily unavailable until its replacement Pod spins up.

For the standard address space, It would be useful to have the possibility to "drain" a broker. The messages that were on this broker should be rerouted (maybe by sending them back to the Qpid router?) to the other brokers in the Deployment, so all data remains available.

Design

The cleanest way would be to implement a preStop container lifecycle hook.

This hook is called immediately before a container is terminated due to an API request or management event such as liveness probe failure, preemption, resource contention and others. A call to the preStop hook fails if the container is already in terminated or completed state. It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent. No parameters are passed to the handler.

Hook delivery is intended to be at least once, which means that a hook may be called multiple times for any given event, such as for PostStart or PreStop. It is up to the hook implementation to handle this correctly.

A possible gotcha would be that the execution time of the preStop hook is limited by terminationGracePeriodSeconds.

The behavior is similar for a PreStop hook. If the hook hangs during execution, the Pod phase stays in a Terminating state and is killed after terminationGracePeriodSeconds of pod ends. If a PostStart or PreStop hook fails, it kills the Container.

The execution time of the preStop hook would depend on a few factors, and cannot easily estimated:

The number of messages on the respective broker
The performance of the host
Network performance between the Pods

The value of terminationGracePeriodSeconds should be set sufficiently high to cope with a reasonable worst case scenario.

I would be tempted to solve this in the standard-controller/broker-operator in a controlled fashion rather than prestop hooks. The problems with prestop hooks are that they cannot be reliably re-run in the event of failure or temporal unavailability. Another problem is that for queues that are not sharded, there is no broker to take ownership of the messages so sending them back to the router won't work.

Within standard-controller/broker-operator the procedure would be something like this:

kubectl evict node // This causes new pods not to be scheduled on this node
kubectl annotate statefulset broker enmasse.io/evict: true
node can be taken down once statefulset is removed.

Setting the enmasse.io/evict annotation triggers the standard-controller to use the same mechanism as for changing the plan of an address:

Start up new broker and assign addresses to it
Close down incoming links to broker to be evicted
Keep the broker to be evicted until it has been drained by messaging clients
Delete the old broker.

Eviction is complete when statefulset/pod is removed.

In the same way as for plan changes, this will only work for queues, but it should be reliable. It would also not be difficult to implement, as the mechanism is already there.

What do you think @nathan-vp ? Would this solve your use case?

Start up new broker and assign addresses to it

Close down incoming links to broker to be evicted

Keep the broker to be evicted until it has been drained by messaging clients

Delete the old broker.

I agree that this mechanism is better, as it also works for the brokered AddressSpace.

However, I see a problem with the workflow that you propose (annotating the Statefulset manually before taking a node out of commission):

Pods are not always evicted by an administrator. E.g. in the case of resource starvation, the Kubelet will do this according to the eviction policy. In these occasions the scheduler will not set the enmasse.io/evict annotation on the Statefulset for us.

The manual annotation step also feels a bit convoluted and "un-native" to me.

First random idea that springs to my mind:

On a shutdown signal, the broker Pod blocks (waits with shutting down) until there are no more messages on its addresses, up to a maximum of terminationGracePeriodSeconds
The standard-controller/broker-operator watches the Node objects to see whether they are unschedulable via .spec.unschedulable: True
- If a Node is marked unschedulable and contains a Broker ReplicaSet, start the eviction procedure as proposed by @lulf
From the moment the broker addresses are fully drained, the Pod shuts down

What do you think?

@nathan-vp

Pods are not always evicted by an administrator. E.g. in the case of resource starvation, the Kubelet will do this according to the eviction policy. In these occasions the scheduler will not set the enmasse.io/evict annotation on the Statefulset for us.

Ah yes, I did not consider this case, as I was mainly thinking about the failed node case where it would not matter if the controller was able to 'react' anyway since the node would anyway be gone.

The resource starvation case is a bit interesting, because you could also 'solve' this via pod priorities and affinity, which we should also add better support for.

In general I agree it is not very cloud-native :) But it could anyway be a nice 'escape-hatch' if no other mechanism is available due to insufficient privileges for instance.

First random idea that springs to my mind: On a shutdown signal, the broker Pod blocks (waits with shutting down) until there are no more messages on its addresses, up to a maximum of terminationGracePeriodSeconds The standard-controller/broker-operator watches the Node objects to see whether they are unschedulable via .spec.unschedulable: True If a Node is marked unschedulable and contains a Broker ReplicaSet, start the eviction procedure as proposed by @lulf From the moment the broker addresses are fully drained, the Pod shuts down

I think the 'blocking' behavior should be configurable on the infraconfig like this:

spec.broker.shutdownPolicy: {
  waitUntilDrained: true // default false?
  shutdownGracePeriodSeconds: 30
}

On the watching of node objects, I think this must be an opt-in feature, because access to node objects might be something certain cluster configurations do not want to allow. It could also be a configuration option in the infra config under the admin section perhaps:

spec.admin.enableAutoEvict: true // default false, requires a ClusterRole + ClusterRoleBinding to be created by the administrator for the address-space-admin serviceaccount in order to work

It could be there are better ways to expose this, I'm just writing down my initial thoughts on this.

EnMasseProject / enmasse

Broker draining #2445