Open nathan-vp opened 5 years ago
I would be tempted to solve this in the standard-controller/broker-operator in a controlled fashion rather than prestop hooks. The problems with prestop hooks are that they cannot be reliably re-run in the event of failure or temporal unavailability. Another problem is that for queues that are not sharded, there is no broker to take ownership of the messages so sending them back to the router won't work.
Within standard-controller/broker-operator the procedure would be something like this:
Setting the enmasse.io/evict annotation triggers the standard-controller to use the same mechanism as for changing the plan of an address:
Eviction is complete when statefulset/pod is removed.
In the same way as for plan changes, this will only work for queues, but it should be reliable. It would also not be difficult to implement, as the mechanism is already there.
What do you think @nathan-vp ? Would this solve your use case?
- Start up new broker and assign addresses to it
- Close down incoming links to broker to be evicted
- Keep the broker to be evicted until it has been drained by messaging clients
- Delete the old broker.
I agree that this mechanism is better, as it also works for the brokered AddressSpace.
However, I see a problem with the workflow that you propose (annotating the Statefulset manually before taking a node out of commission):
Pods are not always evicted by an administrator. E.g. in the case of resource starvation, the Kubelet will do this according to the eviction policy. In these occasions the scheduler will not set the enmasse.io/evict
annotation on the Statefulset for us.
The manual annotation step also feels a bit convoluted and "un-native" to me.
First random idea that springs to my mind:
terminationGracePeriodSeconds
.spec.unschedulable: True
What do you think?
@nathan-vp
Pods are not always evicted by an administrator. E.g. in the case of resource starvation, the Kubelet will do this according to the eviction policy. In these occasions the scheduler will not set the enmasse.io/evict annotation on the Statefulset for us.
Ah yes, I did not consider this case, as I was mainly thinking about the failed node case where it would not matter if the controller was able to 'react' anyway since the node would anyway be gone.
The resource starvation case is a bit interesting, because you could also 'solve' this via pod priorities and affinity, which we should also add better support for.
In general I agree it is not very cloud-native :) But it could anyway be a nice 'escape-hatch' if no other mechanism is available due to insufficient privileges for instance.
First random idea that springs to my mind: On a shutdown signal, the broker Pod blocks (waits with shutting down) until there are no more messages on its addresses, up to a maximum of terminationGracePeriodSeconds The standard-controller/broker-operator watches the Node objects to see whether they are unschedulable via .spec.unschedulable: True If a Node is marked unschedulable and contains a Broker ReplicaSet, start the eviction procedure as proposed by @lulf From the moment the broker addresses are fully drained, the Pod shuts down
I think the 'blocking' behavior should be configurable on the infraconfig like this:
spec.broker.shutdownPolicy: {
waitUntilDrained: true // default false?
shutdownGracePeriodSeconds: 30
}
On the watching of node objects, I think this must be an opt-in feature, because access to node objects might be something certain cluster configurations do not want to allow. It could also be a configuration option in the infra config under the admin section perhaps:
spec.admin.enableAutoEvict: true // default false, requires a ClusterRole + ClusterRoleBinding to be created by the administrator for the address-space-admin serviceaccount in order to work
It could be there are better ways to expose this, I'm just writing down my initial thoughts on this.
Overview
In certain situations a broker will be temporarily unavailable:
The messages that were stored on that respective broker will be temporarily unavailable until its replacement Pod spins up.
For the standard address space, It would be useful to have the possibility to "drain" a broker. The messages that were on this broker should be rerouted (maybe by sending them back to the Qpid router?) to the other brokers in the Deployment, so all data remains available.
Design
The cleanest way would be to implement a preStop container lifecycle hook.
A possible gotcha would be that the execution time of the preStop hook is limited by
terminationGracePeriodSeconds
.The execution time of the preStop hook would depend on a few factors, and cannot easily estimated:
The value of
terminationGracePeriodSeconds
should be set sufficiently high to cope with a reasonable worst case scenario.