Closed razvanphp closed 4 months ago
Hi!
In this kind of scenarios it may be necessary to perform some manual intervention. Could you try running the chart with diagnosticMode.enabled=true and try to perform the initialization steps manually? You can run kubectl exec
to enter the container and then run this command
/opt/bitnami/scripts/rabbitmq/entrypoint.sh /opt/bitnami/rabbitmq/rabbitmq/run.sh
Well, I know how to manually fix it, but what I'm suggesting is to find a better way to handle this in the chart, as currently, I would not trust to deploy this to production, one cannot expect that the nodes will always shutdown in a specific order.
root@truenas[~]# kubectl patch statefulset rabbitmq -n vampirebyte --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]'
statefulset.apps/rabbitmq patched
root@truenas[~]# kubectl patch statefulset rabbitmq -n vampirebyte --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/readinessProbe"}]'
statefulset.apps/rabbitmq patched
root@truenas[~]# kubectl get pods -n vampirebyte -l app.kubernetes.io/instance=rabbitmq
NAME READY STATUS RESTARTS AGE
rabbitmq-0 0/1 Running 217 (2m45s ago) 25h
root@truenas[~]# kubectl scale statefulset rabbitmq --replicas=3 -n vampirebyte
statefulset.apps/rabbitmq scaled
root@truenas[~]# kubectl get pods -n vampirebyte -l app.kubernetes.io/instance=rabbitmq
NAME READY STATUS RESTARTS AGE
rabbitmq-0 0/1 Running 217 (3m21s ago) 25h
root@truenas[~]#
root@truenas[~]#
root@truenas[~]# kubectl delete pod rabbitmq-0 -n vampirebyte
pod "rabbitmq-0" deleted
root@truenas[~]# kubectl get pods -n vampirebyte -l app.kubernetes.io/instance=rabbitmq
NAME READY STATUS RESTARTS AGE
rabbitmq-0 1/1 Running 0 83s
rabbitmq-1 1/1 Running 0 79s
rabbitmq-2 1/1 Running 0 75s
root@truenas[~]#
I think the main issue to solve is to make sure all 3 pods are started all the tine, no matter if the probes fail, otherwise the cluster will never recover with just 1 node....
Hi!
We plan to change the podManagementPolicy to Parallel to avoid this kind of issues. In the meantime, you can set it in the values.yaml but we plan to change it by default. This was recommended by the upstream RabbitMQ team, you can see it here: https://github.com/bitnami/charts/issues/16081#issuecomment-2106462797
@razvanphp RabbitMQ nodes do not expect any specific startup or shutdown sequence starting with 3.7.0. They do expect all peers to come online within 5 minutes by default.
OrderedReady
deployment deadlock on KubernetesSpecifically for Kubernetes and similar tools can run into a deployment deadlock which has been documented in various ways for a while:
Using forceBoot
is a completely unnecessary and dangerous way of "fixing" the problem. You are not fixing anything, you are using a specialized mechanism designed to be used when a portion of the cluster is permanently lost.
The easiest option by far is to use the Cluster Operator that also has been around for a while and is maintained by the RabbitMQ core team. When that's not possible, using
rabbitmq-diagnostics ping
for the readiness probepodManagementPolicy: "Parallel"
for the stateful setshould be enough. That's what the Cluster Operator does, specifically the latter part.
@javsalgar given that this question keeps coming up and one way or another, the (completely wrong, as stated many times earlier) recommendation of using forceBoot: true
keeps coming up, I guess https://github.com/bitnami/charts/issues/16081#issuecomment-2106462797 should be a top priority for the RabbitMQ chart.
RabbitMQ can log an extra message when it runs out of attempts contacting cluster peers but we can tell from experience that virtually no one reads logs until told so explicitly.
And since the core team does not have much influence over the "DIY" (Operator-less) installations on Kubernetes, this long understood and solved problem keeps popping up.
I just want to mention that the error logs are not displayed by default, one must also set
image:
debug: true
to see what actually happens, the mnesia tables error.
I would suggest we go back to basics and make things easy again, like removing the forceBoot completely, even from suggestions and align with what @michaelklishin is saying.
@javsalgar here's a PR to get the ball rolling: https://github.com/bitnami/charts/pull/25873. Hopefully it will stop the bleeding (this kind of questions) and direct folks towards understanding what's going on with their deployments and what are the two options they have :)
As for whether forceBoot
should be removed, I don't have an opinion. With the right defaults and documentation in place, it won't be used much (by new deployments anyway).
If @javsalgar and his team decide to remove forceBoot
, I'm not going to complain because an option of running rabbitmqctl force_boot
or setting the env variable would both still be around.
@javsalgar is the logging behavior mentioned above intentional? RabbitMQ community Docker image does not suppress nodes by default, so I'm curious why this is the case. I'd personally always want more users to have easy access to RabbitMQ logs since that's the very first thing we ask for, both on GitHub and in response to commercial tickets.
Regarding logging, if I don't add that image debug true, logs stop here and never output anything:
rabbitmq 08:33:51.13 INFO ==> ** Starting RabbitMQ **
2024-05-13 08:33:56.761574+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
2024-05-13 08:33:56.769160+00:00 [notice] <0.235.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
2024-05-13 08:33:56.770500+00:00 [notice] <0.235.0> Logging: configured log handlers are now ACTIVE
I thought it uses syslog, but RABBITMQ_LOGS=-
means it's using stdout, so should not be syslog detected inside docker, right?
Regarding forceBoot
I think it should be removed especially because besides the confusion, it does not solve the problem described in this issue, I've tried it. I still had to manually intervene.
@javsalgar is the logging behavior mentioned above intentional? RabbitMQ community Docker image does not suppress nodes by default, so I'm curious why this is the case. I'd personally always want more users to have easy access to RabbitMQ logs since that's the very first thing we ask for, both on GitHub and in response to commercial tickets.
We show the application and error logs to stdout by default. The one that we supress has to do with the bash initialization logic to avoid adding unnecessary noise to the initialization logs, unless it fails. However, it makes sense to revisit the logging of that specific part of the initialization to make it easier to spot any error.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Name and Version
bitnami/rabbitmq 12.15.0
What architecture are you using?
amd64
What steps will reproduce the bug?
We run this chart in TrueNAS server, deployed with fluxCD. With 3 pods, restart the k3s node, and cluster will not recover.
Are you using any custom parameters or values?
What is the expected behavior?
Cluster should be able to recover, seems that
does not help.
What do you see instead?
Cluster (of 3 nodes) is not able to recover after server shutdown.
Additional information
So the readiness and liveness probes fail with:
Checking the logs I see those, only one pod is up instead of 3: