StackStorm / stackstorm-k8s

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" https://stackstorm.com/) Highly Availability fleet as a simple to use reproducible infrastructure-as-code app
https://helm.stackstorm.com/
Apache License 2.0
105 stars 107 forks source link

rabbitmq CrashLoopBackOff after moving to Helm3 #184

Closed manisha-tanwar closed 3 years ago

manisha-tanwar commented 3 years ago

Hi, I'm trying to migrate to Helm3, everything looks good except rabbitmq.

stackstorm-linux-gb5-rabbitmq-0 0/1 CrashLoopBackOff 3 3m16s

Here are the logs from container, I'm not able to figure out what's wrong with it.

kubectl logs -f stackstorm-linux-gb5-rabbitmq-0 15:18:47.48 15:18:47.49 Welcome to the Bitnami rabbitmq container 15:18:47.49 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-rabbitmq 15:18:47.50 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-rabbitmq/issues 15:18:47.50 15:18:47.50 INFO ==> Starting RabbitMQ setup 15:18:47.53 INFO ==> Validating settings in RABBITMQ_* env vars.. 15:18:47.57 INFO ==> Initializing RabbitMQ... 15:18:47.67 INFO ==> Persisted data detected. Restoring... 15:18:47.67 INFO ==> RabbitMQ setup finished!

15:18:47.71 INFO ==> Starting RabbitMQ

Configuring logger redirection 2021-02-19 15:18:59.722 [debug] <0.284.0> Lager installed handler error_logger_lager_h into error_logger 2021-02-19 15:18:59.732 [debug] <0.287.0> Lager installed handler lager_forwarder_backend into error_logger_lager_event 2021-02-19 15:18:59.732 [debug] <0.290.0> Lager installed handler lager_forwarder_backend into rabbit_log_lager_event 2021-02-19 15:18:59.732 [debug] <0.293.0> Lager installed handler lager_forwarder_backend into rabbit_log_channel_lager_event 2021-02-19 15:18:59.732 [debug] <0.296.0> Lager installed handler lager_forwarder_backend into rabbit_log_connection_lager_event 2021-02-19 15:18:59.732 [debug] <0.311.0> Lager installed handler lager_forwarder_backend into rabbit_log_prelaunch_lager_event 2021-02-19 15:18:59.732 [debug] <0.314.0> Lager installed handler lager_forwarder_backend into rabbit_log_queue_lager_event 2021-02-19 15:18:59.732 [debug] <0.323.0> Lager installed handler lager_forwarder_backend into rabbit_log_upgrade_lager_event 2021-02-19 15:18:59.732 [debug] <0.299.0> Lager installed handler lager_forwarder_backend into rabbit_log_feature_flags_lager_event 2021-02-19 15:18:59.732 [debug] <0.302.0> Lager installed handler lager_forwarder_backend into rabbit_log_federation_lager_event 2021-02-19 15:18:59.732 [debug] <0.305.0> Lager installed handler lager_forwarder_backend into rabbit_log_ldap_lager_event 2021-02-19 15:18:59.732 [debug] <0.308.0> Lager installed handler lager_forwarder_backend into rabbit_log_mirroring_lager_event 2021-02-19 15:18:59.732 [debug] <0.317.0> Lager installed handler lager_forwarder_backend into rabbit_log_ra_lager_event 2021-02-19 15:18:59.732 [debug] <0.320.0> Lager installed handler lager_forwarder_backend into rabbit_log_shovel_lager_event 2021-02-19 15:18:59.756 [info] <0.44.0> Application lager started on node 'rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local' 2021-02-19 15:19:00.222 [debug] <0.280.0> Lager installed handler lager_backend_throttle into lager_event 2021-02-19 15:19:00.713 [info] <0.44.0> Application mnesia started on node 'rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local' 2021-02-19 15:19:00.714 [info] <0.269.0> Starting RabbitMQ 3.8.9 on Erlang 22.3 Copyright (c) 2007-2020 VMware, Inc. or its affiliates. Licensed under the MPL 2.0. Website: https://rabbitmq.com

RabbitMQ 3.8.9

########## Copyright (c) 2007-2020 VMware, Inc. or its affiliates.

########## Licensed under the MPL 2.0. Website: https://rabbitmq.com

Doc guides: https://rabbitmq.com/documentation.html Support: https://rabbitmq.com/contact.html Tutorials: https://rabbitmq.com/getstarted.html Monitoring: https://rabbitmq.com/monitoring.html

Logs:

Config file(s): /opt/bitnami/rabbitmq/etc/rabbitmq/rabbitmq.conf

Starting broker...2021-02-19 15:19:00.722 [info] <0.269.0> node : rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local home dir : /opt/bitnami/rabbitmq/.rabbitmq config file(s) : /opt/bitnami/rabbitmq/etc/rabbitmq/rabbitmq.conf cookie hash : OSQ7aWhTYVAQ+HDqyBFf4w== log(s) : database dir : /bitnami/rabbitmq/mnesia/rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local 2021-02-19 15:19:07.025 [info] <0.269.0> Running boot step pre_boot defined by app rabbit 2021-02-19 15:19:07.025 [info] <0.269.0> Running boot step rabbit_core_metrics defined by app rabbit 2021-02-19 15:19:07.027 [info] <0.269.0> Running boot step rabbit_alarm defined by app rabbit 2021-02-19 15:19:07.042 [info] <0.350.0> Memory high watermark set to 6403 MiB (6714331955 bytes) of 16008 MiB (16785829888 bytes) total 2021-02-19 15:19:07.068 [info] <0.352.0> Enabling free disk space monitoring 2021-02-19 15:19:07.068 [info] <0.352.0> Disk free limit set to 50MB 2021-02-19 15:19:07.087 [info] <0.269.0> Running boot step code_server_cache defined by app rabbit 2021-02-19 15:19:07.088 [info] <0.269.0> Running boot step file_handle_cache defined by app rabbit 2021-02-19 15:19:07.089 [info] <0.355.0> Limiting to approx 1048479 file handles (943629 sockets) 2021-02-19 15:19:07.089 [info] <0.356.0> FHC read buffering: OFF 2021-02-19 15:19:07.089 [info] <0.356.0> FHC write buffering: ON 2021-02-19 15:19:07.090 [info] <0.269.0> Running boot step worker_pool defined by app rabbit 2021-02-19 15:19:07.090 [info] <0.342.0> Will use 12 processes for default worker pool 2021-02-19 15:19:07.091 [info] <0.342.0> Starting worker pool 'worker_pool' with 12 processes in it 2021-02-19 15:19:07.107 [info] <0.269.0> Running boot step database defined by app rabbit 2021-02-19 15:19:07.119 [info] <0.269.0> Node database directory at /bitnami/rabbitmq/mnesia/rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local is empty. Assuming we need to join an existing cluster or initialise from scratch... 2021-02-19 15:19:07.119 [info] <0.269.0> Configured peer discovery backend: rabbit_peer_discovery_k8s 2021-02-19 15:19:07.143 [info] <0.269.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s 2021-02-19 15:19:07.143 [info] <0.269.0> Peer discovery backend does not support locking, falling back to randomized delay 2021-02-19 15:19:07.143 [info] <0.269.0> Peer discovery backend rabbit_peer_discovery_k8s supports registration. 2021-02-19 15:19:07.143 [info] <0.269.0> Will wait for 713 milliseconds before proceeding with registration... 2021-02-19 15:19:08.068 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:08.076 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 9 retries left... 2021-02-19 15:19:08.582 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:08.589 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 8 retries left... 2021-02-19 15:19:09.093 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:09.098 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 7 retries left... 2021-02-19 15:19:09.605 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:09.611 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 6 retries left... 2021-02-19 15:19:10.116 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:10.123 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 5 retries left... 2021-02-19 15:19:10.627 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:10.631 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 4 retries left... 2021-02-19 15:19:11.138 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:11.144 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 3 retries left... 2021-02-19 15:19:11.648 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:11.653 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 2 retries left... 2021-02-19 15:19:12.174 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:12.181 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 1 retries left... 2021-02-19 15:19:12.687 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403 2021-02-19 15:19:12.694 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 0 retries left... 2021-02-19 15:19:13.204 [info] <0.44.0> Application mnesia exited with reason: stopped 2021-02-19 15:19:13.204 [error] <0.269.0> 2021-02-19 15:19:13.204 [info] <0.44.0> Application mnesia exited with reason: stopped 2021-02-19 15:19:13.205 [error] <0.269.0> BOOT FAILED 2021-02-19 15:19:13.205 [error] <0.269.0> =========== 2021-02-19 15:19:13.205 [error] <0.269.0> Exception during startup: 2021-02-19 15:19:13.205 [error] <0.269.0> 2021-02-19 15:19:13.205 [error] <0.269.0> rabbit_boot_steps:run_boot_steps/1 line 20 2021-02-19 15:19:13.205 [error] <0.269.0> rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19 2021-02-19 15:19:13.205 [error] <0.269.0> rabbit_boot_steps:run_step/2 line 46 2021-02-19 15:19:13.205 [error] <0.269.0> rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 41 2021-02-19 15:19:13.205 [error] <0.269.0> rabbit_mnesia:init/0 line 76 2021-02-19 15:19:13.206 [error] <0.269.0> rabbit_mnesia:init_with_lock/3 line 111

BOOT FAILED

Exception during startup:

rabbit_boot_steps:run_boot_steps/1 line 20
rabbit_boot_steps:'-run_boot_steps/1-lc$^0/1-0-'/1 line 19
rabbit_boot_steps:run_step/2 line 46
rabbit_boot_steps:'-run_step/2-lc$^0/1-0-'/2 line 41
rabbit_mnesia:init/0 line 76

2021-02-19 15:19:13.206 [error] <0.269.0> rabbit_mnesia:run_peer_discovery_with_retries/2 line 145 2021-02-19 15:19:13.206 [error] <0.269.0> rabbit_mnesia:run_peer_discovery_with_retries/2 line 138 2021-02-19 15:19:13.206 [error] <0.269.0> error:{badmatch,ok} rabbit_mnesia:init_with_lock/3 line 111 rabbit_mnesia:run_peer_discovery_with_retries/2 line 145 rabbit_mnesia:run_peer_discovery_with_retries/2 line 138 error:{badmatch,ok} 2021-02-19 15:19:13.206 [error] <0.269.0>

2021-02-19 15:19:14.208 [info] <0.268.0> [{initial_call,{application_master,init,['Argument1','Argument2','Argument3','Argument4']}},{pid,<0.268.0>},{registered_name,[]},{error_info,{exit,{{badmatch,ok},{rabbit,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}},{ancestors,[<0.267.0>]},{message_queue_len,1},{messages,[{'EXIT',<0.269.0>,normal}]},{links,[<0.267.0>,<0.44.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,367}], [] 2021-02-19 15:19:14.209 [error] <0.268.0> CRASH REPORT Process <0.268.0> with 0 neighbours exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138 2021-02-19 15:19:14.209 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}} 2021-02-19 15:19:14.210 [info] <0.44.0> Application rabbit exited with reason: {{badmatch,ok},{rabbit,start,[normal,[]]}} {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{badmatch,ok},{rabbit,start,[normal,[]]}}}"} Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{badmatch,ok},{rabbit,start,[normal,[]]}}})

Crash dump is being written to: /opt/bitnami/rabbitmq/var/log/rabbitmq/erl_crash.dump...done

manisha-tanwar commented 3 years ago

Events from pod suggests it's Backing off because for failed readiness:

Events:
  Type     Reason     Age        From                            Message
  ----     ------     ----       ----                            -------
  Normal   Scheduled  <unknown>  default-scheduler               Successfully assigned stackstorm-test/stackstorm-linux-gb5-rabbitmq-0 to gb5-st-kubeworker-014
  Warning  Unhealthy  8m29s      kubelet, gb5-st-kubeworker-014  Readiness probe failed: Error: unable to perform an operation on node 'rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local']

rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local:
  * connected to epmd (port 4369) on stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-391-rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local'
 * effective user's home directory: /opt/bitnami/rabbitmq/.rabbitmq
 * Erlang cookie hash: OSQ7aWhTYVAQ+HDqyBFf4w==
  Warning  Unhealthy  8m  kubelet, gb5-st-kubeworker-014  Readiness probe failed: Error:
RabbitMQ on node rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local is not running or has not fully booted yet (check with is_booting)

Services seems okay to me.

kubectl get svc | grep rabbitmq
stackstorm-linux-gb5-rabbitmq                            ClusterIP   10.33.1.245    <none>        5672/TCP,4369/TCP,25672/TCP,15672/TCP                            10m
stackstorm-linux-gb5-rabbitmq-headless                   ClusterIP   None           <none>        4369/TCP,5672/TCP,25672/TCP,15672/TCP                            10m

How can I fix this, any idea would be helpful. Thanks@

arm4b commented 3 years ago

I'm not sure if old vs new RabbitMQ Helm charts are backward-compatible. Try to clean the RabbitMQ volumes and start again.

The migration with Helm 2 -> Helm 3 is pretty much a breaking change because we had to replace all the Helm chart dependencies (hence v0.50.0 release).

arm4b commented 3 years ago

There's something interesting in your logs:

2021-02-19 15:19:07.119 [info] <0.269.0> Node database directory at /bitnami/rabbitmq/mnesia/rabbit@stackstorm-linux-gb5-rabbitmq-0.stackstorm-linux-gb5-rabbitmq-headless.stackstorm-test.svc.cluster.local is empty. Assuming we need to join an existing cluster or initialise from scratch...
2021-02-19 15:19:07.119 [info] <0.269.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2021-02-19 15:19:07.143 [info] <0.269.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2021-02-19 15:19:07.143 [info] <0.269.0> Peer discovery backend does not support locking, falling back to randomized delay
2021-02-19 15:19:07.143 [info] <0.269.0> Peer discovery backend rabbit_peer_discovery_k8s supports registration.
2021-02-19 15:19:07.143 [info] <0.269.0> Will wait for 713 milliseconds before proceeding with registration...
2021-02-19 15:19:08.068 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403
2021-02-19 15:19:08.076 [error] <0.269.0> Peer discovery returned an error: "403". Will retry after a delay of 500 ms, 9 retries left...
2021-02-19 15:19:08.582 [error] <0.269.0> Failed to fetch a list of nodes from Kubernetes API: 403

Forbidden 403 while trying to fetch K8s API for the list of nodes. It could be something with K8s security/settings, though I'm not sure why RMQ is trying to access the K8s API.

You can also ask at https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq where the original RabbitMQ Helm chart is hosted.

manisha-tanwar commented 3 years ago

Hi @armab, thanks for investing your time here. I'm using v0.52.0 and deleted everything(helm2 stuff), it's kind of fresh install now.. Also I've raised issue in bitnami repo.. Just waiting for some response there.

manisha-tanwar commented 3 years ago

Got serviceaccount created with attached rolebinding. & It worked fine without any configuration changes. This is what I've now:

  rbac:
    create: false
  serviceAccount:
    create: false
    name: rabbitmq-ha
    automountServiceAccountToken: true

Thanks for the help!