rabbitmq-server fails to start

pcm32 commented 1 year ago

I have just tried spinning up with default settings besides this in a values.yaml file:

persistence:
  existingClaim: vol-nfs-resize-test

postgresql:
  galaxyDatabasePassword: changeme
  persistence:
    storageClass: ssd-cinder

I see the following containers waiting and failing:

csi-attacher-s3-0                                                1/1     Running            0               19m
csi-provisioner-s3-0                                             2/2     Running            0               19m
csi-s3-ggw7l                                                     2/2     Running            0               19m
csi-s3-wrn4x                                                     2/2     Running            0               19m
galaxy-galaxy-ps-postgres-0                                      1/1     Running            0               18m
galaxy-ps-celery-beat-68df947565-c5c86                           0/1     Init:0/1           0               19m
galaxy-ps-celery-c4f56c66-69bmn                                  0/1     Init:0/1           0               19m
galaxy-ps-job-0-6f7b45fb5c-4n774                                 0/1     Init:0/1           0               19m
galaxy-ps-nginx-7fc49c647b-q9b8f                                 1/1     Running            0               19m
galaxy-ps-postgres-6cbdd9bb48-tfwm5                              1/1     Running            0               19m
galaxy-ps-rabbitmq-7667976885-vj8g7                              1/1     Running            0               19m
galaxy-ps-rabbitmq-messaging-topology-operator-5688fcb48-ls47r   1/1     Running            0               19m
galaxy-ps-rabbitmq-server-server-0                               0/1     CrashLoopBackOff   5 (2m34s ago)   18m
galaxy-ps-tusd-7764dfc65f-rnvlf                                  1/1     Running            0               19m
galaxy-ps-web-5dbccb7759-cr2ml                                   0/1     Init:0/1           0               19m
galaxy-ps-workflow-c78764585-nj59g                               0/1     Init:0/1           0               19m
test-nfs-resize-nfs-server-provisioner-0                         1/1     Running            0               20m

The rabbit container fails with:

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
rabbitmq 14:18:48.56
rabbitmq 14:18:48.56 Welcome to the Bitnami rabbitmq container
rabbitmq 14:18:48.56 Subscribe to project updates by watching https://github.com/bitnami/containers
rabbitmq 14:18:48.56 Submit issues and feature requests at https://github.com/bitnami/containers/issues
rabbitmq 14:18:48.57
rabbitmq 14:18:48.57 INFO  ==> ** Starting RabbitMQ setup **
rabbitmq 14:18:48.58 INFO  ==> Validating settings in RABBITMQ_* env vars..
rabbitmq 14:18:48.62 INFO  ==> Initializing RabbitMQ...
rabbitmq 14:18:48.71 INFO  ==> Starting RabbitMQ in background...
/opt/bitnami/scripts/libos.sh: line 336:    53 Killed                  "$@" > /dev/null 2>&1
rabbitmq 14:20:48.57 ERROR ==> Couldn't start RabbitMQ in background.

it seems to be OOMKilled:

    State:          Running
      Started:      Mon, 14 Nov 2022 14:18:48 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    1
      Started:      Mon, 14 Nov 2022 14:13:57 +0000
      Finished:     Mon, 14 Nov 2022 14:15:57 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   2Gi

but those seem to be the default parameters for limits/requests. Are you running this with more memory in general?

This is kubernetes 1.23.5 running on Fedora CoreOS 35 nodes.

Chart version is: galaxy-5.3.1 App version: 22.05

Nodes seem to have plenty of more memory:

NAME                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
test-nfs-j4rmgncctsu3-node-3     233m         5%     4134Mi          26%
test-nfs-j4rmgncctsu3-node-5     165m         4%     1753Mi          11%

pcm32 commented 1 year ago

Adding more memory doesn't seem to have an impact. I'm trying to set the BITNAMI_DEBUG env var, but modifications via kubectl edit statefulset/<> for rabbit don't seem to stick. Modifications for memory worked on the Rabbitmq cluster object, but there are no env vars to set there to activate the BITNAMI_DEBUG part.

nuwang commented 1 year ago

For which container are you seeing the OOM killed message? If it had been an OOM in the rabbitmq container itself, the ending in the logs should have been very abrupt, and there should have been no time to print any error messages?

What if you increase verbosity in rabbit?

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster

...

spec:
  replicas: 1
  rabbitmq:
    additionalConfig: |
        log.console.level = debug

nuwang commented 1 year ago

And also try passing this to the operator through the helm chart?

rabbitmq-cluster-operator:
  extraEnvVars:
    - name: LOG_LEVEL
      value: debug

ref: https://artifacthub.io/packages/helm/riftbit/rabbitmq-cluster-operator#additional-environment-variables

ksuderman commented 1 year ago

I am using the default memory setting on the clusters I have running. But I wonder if the OOMKilled is a red herring.

/opt/bitnami/scripts/libos.sh: line 336: 53 Killed "$@" > /dev/null 2>&1

This has me wondering if 53 is the root error (EBADR 53 Invalid request descriptor). Line 336 of libos.sh is the line that redirects output to /dev/null, and if output is being redirected to /dev/null what is it doing in the log?

pcm32 commented 1 year ago

Thanks guys! Yes, I suspect as well that that OOMKilled is really something else and k8s getting dizzy with the signals.

I was looking for places to increase logging / verbosity, great that you spotted them @nuwang. Unfortunately I applied those but there are no additional logs in the rabbitmq container.

I was trying as well to inject the BITNAMI_DEBUG env var to the rabbitmq container, tried adding it to the extraEnvVars of rabbitmq-cluster-operator, but doesn't seem to trickle down to that container :-(.

pcm32 commented 1 year ago

For which container are you seeing the OOM killed message?

Yes, it is on the main rabbitmq container. Yes, agreed the log interruption would have been more abrupt.

pcm32 commented 1 year ago

I think that part of my confusion came from the fact that I though that we were using RabbitMQ's own operator, and not Bitnami's.

pcm32 commented 1 year ago

my bad, so we are using rabbitmq's own operator, just packaged by bitnami on a helm chart.

pcm32 commented 1 year ago

I also tried adding the use of a different rabbitmq image to the rabbitmq operator on the values.yaml for the Galaxy helm chart like this, but it doesn't get picked up:

rabbitmq-cluster-operator:
  rabbitmqImage:
    repository: rabbitmq
    tag: 3

  extraEnvVars:
    - name: LOG_LEVEL
      value: debug
    - name: BITNAMI_DEBUG
      value: "true"

but it still remains like:

Containers:
  rabbitmq:
    Container ID:   containerd://34087c1446101c8f19f5421d49a7e5af6ea49fc4a05bdd6a5728f133050f5862
    Image:          docker.io/bitnami/rabbitmq:3.10.7-debian-11-r3
    Image ID:       docker.io/bitnami/rabbitmq@sha256:66991b35756345c9c8bfc0c38d0277c3950c446a0d7e49b09292c21a7cd24d9e
    Ports:          4369/TCP, 5672/TCP, 15672/TCP, 15692/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Tue, 15 Nov 2022 12:57:59 +0000
    Ready:          False

this is at the first level of values.yaml, should this be under another item like dependencies or something? Thanks.

pcm32 commented 1 year ago

I guess it would require some additions here? https://github.com/galaxyproject/galaxy-helm/blob/6b071c9805bd3a65d75ff28aa562ff7405d74209/galaxy/templates/rabbitmqcluster.yaml#L10

nuwang commented 1 year ago

@pcm32 Yes. You can try adding an override here, passing in the BITNAMI_DEBUG env var to the statefulset: https://github.com/galaxyproject/galaxy-helm/blob/6b071c9805bd3a65d75ff28aa562ff7405d74209/galaxy/templates/rabbitmqcluster.yaml#L11

Overrides are documented here: https://www.rabbitmq.com/kubernetes/operator/using-operator.html#override

pcm32 commented 1 year ago

This seems to be a problem specifically between the bitnami container (possibly related to the usage of /dev/null) and the underlying Fedora CoreOS being used by that original cluster as VM images. Moving to an RKE2 based cluster (which uses ubuntu focal as VM image) makes the problem go away.

pcm32 commented 1 year ago

I'm going to close this since I'm not using CoreOS anymore for this, and hence not getting this anymore.

galaxyproject / galaxy-helm

rabbitmq-server fails to start #390