We had something wrong in our inventory causing the URP health check to fail after a broker restart. Even if the task itself fails, there seems to be something wrong in the logic in the following tasks - allowing the play to continue. Luckily we observed the pipeline when running, so we managed to cancel the play. If not, I would suspect this bug to serially bring down the whole cluster.
Introduce some bug in the inventory causing the URP health check to fail. Observe that the play continues with the subsequent broker host in the cluster. The task Fail Provisioning is skipped.
Expected behaviour
The play should fail on the first broker not passing the URP health check by running the Fail Provisioning task.
Inventory File
Can provide details on request, but I expect it to be less relevant for this bug. Cluster: 3 zookeepers and 3 brokers running on dedicated hosts, serial deployment strategy.
Environment (please complete the following information):
I suspect the issue to be related to this logical expression. Why are we ignoring/delaying the play to error out? Would it be possible to use a handler to error out instead - to avoid duplicating the code?
Describe the issue
We had something wrong in our inventory causing the URP health check to fail after a broker restart. Even if the task itself fails, there seems to be something wrong in the logic in the following tasks - allowing the play to continue. Luckily we observed the pipeline when running, so we managed to cancel the play. If not, I would suspect this bug to serially bring down the whole cluster.
This is the log for the incident:
To Reproduce
Introduce some bug in the inventory causing the URP health check to fail. Observe that the play continues with the subsequent broker host in the cluster. The task
Fail Provisioning
is skipped.Expected behaviour
The play should fail on the first broker not passing the URP health check by running the
Fail Provisioning
task.Inventory File
Can provide details on request, but I expect it to be less relevant for this bug. Cluster: 3 zookeepers and 3 brokers running on dedicated hosts, serial deployment strategy.
Environment (please complete the following information):
Additional context
I suspect the issue to be related to this logical expression. Why are we ignoring/delaying the play to error out? Would it be possible to use a handler to error out instead - to avoid duplicating the code?
CC: @nsharma-git @domenicbove