Open pavel-spacil opened 1 month ago
Attaching debug logs from rabbitmq nodes where 2+1 got deployed...
Hi @pavel-spacil
Could you please provide the chart parameters (provided via values.yaml
or using the --set
flag) you used to install your Helm release? Please exclusively provide the parameters you customized avoiding the ones with default values.
Note: you can easily obtain the above parameters using
helm get values RELEASE_NAME
Hello @juan131, the operator itself is deployed using these values:
fullnameOverride: rabbitmq
clusterOperator:
watchAllNamespaces: false
watchNamespaces:
- rabbitmq
msgTopologyOperator:
watchAllNamespaces: false
watchNamespaces:
- rabbitmq
I still think that this is not a problem with the operator itself but rather with bitnami's version of rabbitmq as I can deploy the rabbitmq cluster just fine using the vanilla rabbitmq docker image instead of the bitnami's one.
Hi @pavel-spacil
We could try a few things to debug what's going on:
RabbitmqCluster
to add the BITNAMI_DEBUG
env variable to RabbitMQ pods. This will increase logs verbosity:apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
(...)
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
env:
- name: BITNAMI_DEBUG
value: "true"
rabbitmq-server
:apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
(...)
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
command: ["rabbitmq-server"]
Attaching a new rabbitmq-debug-logs.zip
This time the node 2 started first, the node 1 as a second and based on the logs it tried to cluster with a node 2 but it failed to connect. Based on the timestamps it was just after a shutdown of node 2 after initial configuration done by the init script. The node 0 started last and did not join any other node...
Overriding the entrypoint will result in a fully connected cluster in every try...
Thanks so much @pavel-spacil
Overriding the entrypoint will result in a fully connected cluster in every try...
I think the issue could be related to the default config file name. The Bitnami image expects it to be named rabbitmq.conf
while the operator is using .rabbitmqadmin.conf
, hence the Bitnami init logic is doing some overwrites it should't be doing.
Does it work if you apply the patch below?
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
(...)
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
env:
- name: RABBITMQ_CONF_FILE
value: "/var/lib/rabbitmq/.rabbitmqadmin.conf"
So, I gave it few tries and the success rate is higher now, but not 100%. I still got a non fully connected cluster on some try. Attaching new rabbitmq.zip with debug logs from successful try, those with -pre
suffix are before container restart.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Has there been any progress on this issue?
Name and Version
bitnami/rabbitmq:4.0.2-debian-12-r0
What architecture are you using?
amd64
What steps will reproduce the bug?
rabbitmq-cluster-operator
Helm chart by bitnami in version4.3.24
StorageClass
, in my case EBS oneproduction-ready
example from here with referencing that SCWhat is the expected behavior?
The 3-node cluster gets up and running just fine, no isolated node.
What do you see instead?
The nodes don't join together, in most deployments I've got 2+1 configuration (= 2 isolated clusters, one with 2 nodes, second with a just one), but sometimes also 1+1+1, never just 3...
Additional information
I've also lowered the DNS cache to 5 secs, as advised in the doc, but still no success. One startup flow as an example:
-0
gets started first based on logs, the-1
as a second and did not join the-0
. The-2
hangs inStarting RabbitMQ in background...
and after timeout/restart it starts and join-0
...Tried vanilla
rabbitmq:4.0.2
Docker image from dockerhub and it works just fine in all cases.