bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9.06k stars 9.24k forks source link

[bitnami/rabbitmq-cluster-operator] Not able to build a fully running cluster #29989

Open pavel-spacil opened 1 month ago

pavel-spacil commented 1 month ago

Name and Version

bitnami/rabbitmq:4.0.2-debian-12-r0

What architecture are you using?

amd64

What steps will reproduce the bug?

  1. Have an AWS EKS cluster, tried 1.29 as well as 1.30
  2. Deploy rabbitmq-cluster-operator Helm chart by bitnami in version 4.3.24
  3. Define some StorageClass, in my case EBS one
  4. Deploy production-ready example from here with referencing that SC

What is the expected behavior?

The 3-node cluster gets up and running just fine, no isolated node.

What do you see instead?

The nodes don't join together, in most deployments I've got 2+1 configuration (= 2 isolated clusters, one with 2 nodes, second with a just one), but sometimes also 1+1+1, never just 3...

Additional information

I've also lowered the DNS cache to 5 secs, as advised in the doc, but still no success. One startup flow as an example:

Tried vanilla rabbitmq:4.0.2 Docker image from dockerhub and it works just fine in all cases.

pavel-spacil commented 1 month ago

rabbitmq-logs.zip

Attaching debug logs from rabbitmq nodes where 2+1 got deployed...

juan131 commented 1 month ago

Hi @pavel-spacil

Could you please provide the chart parameters (provided via values.yaml or using the --set flag) you used to install your Helm release? Please exclusively provide the parameters you customized avoiding the ones with default values.

Note: you can easily obtain the above parameters using helm get values RELEASE_NAME

pavel-spacil commented 1 month ago

Hello @juan131, the operator itself is deployed using these values:

fullnameOverride: rabbitmq
clusterOperator:
  watchAllNamespaces: false
  watchNamespaces:
  - rabbitmq
msgTopologyOperator:
  watchAllNamespaces: false
  watchNamespaces:
  - rabbitmq

I still think that this is not a problem with the operator itself but rather with bitnami's version of rabbitmq as I can deploy the rabbitmq cluster just fine using the vanilla rabbitmq docker image instead of the bitnami's one.

juan131 commented 1 month ago

Hi @pavel-spacil

We could try a few things to debug what's going on:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
  (...)
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              env:
              - name: BITNAMI_DEBUG
                value: "true"
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
  (...)
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              command: ["rabbitmq-server"]
pavel-spacil commented 1 month ago

Attaching a new rabbitmq-debug-logs.zip

This time the node 2 started first, the node 1 as a second and based on the logs it tried to cluster with a node 2 but it failed to connect. Based on the timestamps it was just after a shutdown of node 2 after initial configuration done by the init script. The node 0 started last and did not join any other node...

Overriding the entrypoint will result in a fully connected cluster in every try...

juan131 commented 3 weeks ago

Thanks so much @pavel-spacil

Overriding the entrypoint will result in a fully connected cluster in every try...

I think the issue could be related to the default config file name. The Bitnami image expects it to be named rabbitmq.conf while the operator is using .rabbitmqadmin.conf, hence the Bitnami init logic is doing some overwrites it should't be doing.

Does it work if you apply the patch below?

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
(...)
spec:
  (...)
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              env:
              - name: RABBITMQ_CONF_FILE
                value: "/var/lib/rabbitmq/.rabbitmqadmin.conf"
pavel-spacil commented 3 weeks ago

So, I gave it few tries and the success rate is higher now, but not 100%. I still got a non fully connected cluster on some try. Attaching new rabbitmq.zip with debug logs from successful try, those with -pre suffix are before container restart.

github-actions[bot] commented 1 week ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

chrrlesWork commented 4 days ago

Has there been any progress on this issue?