bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9.01k stars 9.22k forks source link

[rabbitmq] With metrics.enabled:true and default chart config, pod gets 401 trying to get metrics #1571

Closed rally25rs closed 4 years ago

rally25rs commented 5 years ago

Which chart:

rabbitmq chart installed from a Helm requirements.yaml:

- name: "rabbitmq"
  repository:  "@stable"
  version: "~6.10.1"

actual chart version in requirements.lock:

- name: rabbitmq
  repository: https://kubernetes-charts.storage.googleapis.com
  version: 6.10.1

Description

With a basic configuration and metrics enabled, metrics errors with

time="2019-11-07T22:53:31Z" level=error msg="Error while retrieving data from rabbitHost" error="<nil>" host="http://localhost:15672" statusCode=401

Steps to reproduce the issue:

Grab the helm requirements.yaml content from above. In values.yaml I am only setting:

rabbitmq:
  metrics:
    enabled: true
  resources: {}
  replicas: 1

Describe the results you received:

I went and checked on my running pods in AWS EKS and noticed that the Rabbitmq pod had 90 restarts after 7h55m of age (so every 5-6 min).

Investigating the logs, I am seeing from rabbitmq container:

...some startup messages...

2019-11-07 23:17:18.336 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.
2019-11-07 23:17:31.026 [warning] <0.628.0> HTTP access denied: user 'user' - invalid credentials
2019-11-07 23:17:31.028 [warning] <0.630.0> HTTP access denied: user 'user' - invalid credentials
2019-11-07 23:17:31.029 [warning] <0.632.0> HTTP access denied: user 'user' - invalid credentials
2019-11-07 23:17:31.030 [warning] <0.634.0> HTTP access denied: user 'user' - invalid credentials
2019-11-07 23:17:37.440 [warning] <0.652.0> HTTP access denied: user 'user' - invalid credentials

...etc...

2019-11-07 23:22:04.396 [warning] <0.1186.0> HTTP access denied: user 'user' - invalid credentials
2019-11-07 23:22:07.396 [warning] <0.1191.0> HTTP access denied: user 'user' - invalid credentials
Stopping and halting node rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local ...
2019-11-07 23:22:08.987 [info] <0.1204.0> RabbitMQ is asked to stop...
2019-11-07 23:22:09.184 [info] <0.1204.0> Stopping RabbitMQ applications and their dependencies in the following order:
    rabbitmq_management
    amqp_client
    rabbitmq_web_dispatch
    cowboy
    cowlib
    rabbitmq_management_agent
    rabbitmq_peer_discovery_k8s
    rabbitmq_peer_discovery_common
    rabbit
    mnesia
    rabbit_common
    sysmon_handler
    os_mon
    ra
2019-11-07 23:22:09.184 [info] <0.1204.0> Stopping application 'rabbitmq_management'
2019-11-07 23:22:09.187 [warning] <0.454.0> RabbitMQ HTTP listener registry could not find context rabbitmq_management_tls
2019-11-07 23:22:09.188 [info] <0.1204.0> Stopping application 'amqp_client'
2019-11-07 23:22:09.188 [info] <0.43.0> Application rabbitmq_management exited with reason: stopped
2019-11-07 23:22:09.190 [info] <0.43.0> Application amqp_client exited with reason: stopped
2019-11-07 23:22:09.190 [info] <0.1204.0> Stopping application 'rabbitmq_web_dispatch'
2019-11-07 23:22:09.192 [info] <0.43.0> Application rabbitmq_web_dispatch exited with reason: stopped
2019-11-07 23:22:09.192 [info] <0.1204.0> Stopping application 'cowboy'
2019-11-07 23:22:09.193 [info] <0.43.0> Application cowboy exited with reason: stopped
2019-11-07 23:22:09.193 [info] <0.1204.0> Stopping application 'cowlib'
2019-11-07 23:22:09.193 [info] <0.1204.0> Stopping application 'rabbitmq_management_agent'
2019-11-07 23:22:09.193 [info] <0.43.0> Application cowlib exited with reason: stopped
2019-11-07 23:22:09.195 [info] <0.43.0> Application rabbitmq_management_agent exited with reason: stopped
2019-11-07 23:22:09.195 [info] <0.1204.0> Stopping application 'rabbitmq_peer_discovery_k8s'
2019-11-07 23:22:09.195 [info] <0.43.0> Application rabbitmq_peer_discovery_k8s exited with reason: stopped
2019-11-07 23:22:09.195 [info] <0.1204.0> Stopping application 'rabbitmq_peer_discovery_common'
2019-11-07 23:22:09.196 [info] <0.43.0> Application rabbitmq_peer_discovery_common exited with reason: stopped
2019-11-07 23:22:09.197 [info] <0.1204.0> Stopping application 'rabbit'
2019-11-07 23:22:09.197 [info] <0.309.0> Will unregister with peer discovery backend rabbit_peer_discovery_k8s
2019-11-07 23:22:09.197 [info] <0.442.0> stopped TCP listener on [::]:5672
2019-11-07 23:22:09.198 [info] <0.386.0> Closing all connections in vhost '/' on node 'rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local' because the vhost is stopping
2019-11-07 23:22:09.198 [info] <0.405.0> Stopping message store for directory '/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2019-11-07 23:22:09.203 [info] <0.405.0> Message store for directory '/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2019-11-07 23:22:09.204 [info] <0.402.0> Stopping message store for directory '/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2019-11-07 23:22:09.212 [info] <0.402.0> Message store for directory '/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@cloud-services-qa-rabbitmq-0.cloud-services-qa-rabbitmq-headless.cloud-services-qa.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2019-11-07 23:22:09.215 [info] <0.43.0> Application rabbit exited with reason: stopped
2019-11-07 23:22:09.215 [info] <0.1204.0> Stopping application 'mnesia'
2019-11-07 23:22:09.217 [info] <0.1204.0> Stopping application 'rabbit_common'
2019-11-07 23:22:09.217 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-11-07 23:22:09.218 [info] <0.43.0> Application rabbit_common exited with reason: stopped
2019-11-07 23:22:09.218 [info] <0.1204.0> Stopping application 'sysmon_handler'
2019-11-07 23:22:09.219 [info] <0.43.0> Application sysmon_handler exited with reason: stopped
2019-11-07 23:22:09.219 [info] <0.1204.0> Stopping application 'os_mon'
2019-11-07 23:22:09.220 [info] <0.43.0> Application os_mon exited with reason: stopped
2019-11-07 23:22:09.220 [info] <0.1204.0> Stopping application 'ra'
2019-11-07 23:22:09.223 [info] <0.43.0> Application ra exited with reason: stopped
2019-11-07 23:22:09.223 [info] <0.1204.0> Successfully stopped RabbitMQ and its dependencies
Gracefully halting Erlang VM
2019-11-07 23:22:09.223 [info] <0.1204.0> Halting Erlang VM with the following applications:
    aten
    xmerl
    ranch
    ssl
    public_key
    asn1
    inets
    stdout_formatter
    gen_batch_server
    tools
    observer_cli
    recon
    credentials_obfuscation
    crypto
    jsx
    lager
    goldrush
    compiler
    syntax_tools
    sasl
    stdlib
    kernel

This repeats about every 5min (This particular container ran from 23:17:18 to 23:22:09) as a new container is started.

Inspecting the metrics container within the pod, I see:

time="2019-11-07T23:12:01Z" level=error msg="Error while retrieving data from rabbitHost" error="<nil>" host="http://localhost:15672" statusCode=401
time="2019-11-07T23:12:01Z" level=error msg="Error while retrieving data from rabbitHost" error="<nil>" host="http://localhost:15672" statusCode=401
time="2019-11-07T23:12:01Z" level=error msg="Error while retrieving data from rabbitHost" error="<nil>" host="http://localhost:15672" statusCode=401
time="2019-11-07T23:12:01Z" level=error msg="Error while retrieving data from rabbitHost" error="<nil>" host="http://localhost:15672" statusCode=401
time="2019-11-07T23:12:01Z" level=info msg="Metrics updated" duration=9.310272ms
time="2019-11-07T23:12:10Z" level=error msg="Error while retrieving data from rabbitHost" error="Get http://localhost:15672/api/exchanges?sort=: dial tcp 127.0.0.1:15672: connect: connection refused" host="http://localhost:15672" statusCode=0
time="2019-11-07T23:12:10Z" level=error msg="Error while retrieving data from rabbitHost" error="Get http://localhost:15672/api/nodes?sort=: dial tcp 127.0.0.1:15672: connect: connection refused" host="http://localhost:15672" statusCode=0
time="2019-11-07T23:12:10Z" level=error msg="Error while retrieving data from rabbitHost" error="Get http://localhost:15672/api/overview?sort=: dial tcp 127.0.0.1:15672: connect: connection refused" host="http://localhost:15672" statusCode=0
time="2019-11-07T23:12:10Z" level=error msg="Error while retrieving data from rabbitHost" error="Get http://localhost:15672/api/queues?sort=: dial tcp 127.0.0.1:15672: connect: connection refused" host="http://localhost:15672" statusCode=0
time="2019-11-07T23:12:10Z" level=info msg="Metrics updated" duration=2.505114ms

If I do a kubectl describe on the pod, I see the events:

Events:
  Type     Reason     Age                   From                                   Message
  ----     ------     ----                  ----                                   -------
  Warning  BackOff    13m (x100 over 7h8m)  kubelet, ip-10-199-1-124.ec2.internal  Back-off restarting failed container
  Normal   Pulled     8m48s (x93 over 8h)   kubelet, ip-10-199-1-124.ec2.internal  Container image "docker.io/bitnami/rabbitmq:3.8.0-debian-9-r0" already present on machine
  Warning  Unhealthy  2m55s (x838 over 8h)  kubelet, ip-10-199-1-124.ec2.internal  Readiness probe failed: curl: (22) The requested URL returned error: 401

Describe the results you expected:

I would expect the metrics to work (maybe it isn't using the generated credentials?) and the pod to not restart every 5 min

Additional information you deem important (e.g. issue happens only occasionally):

I'm not sure if a 5 minute restart is "normal" or if this is 2 issues: the restart, and metrics using invalid credentials.

There is 0 load on rabbit and no clients have connected. The helm chart was just deployed and no other services are connecting to it.

Version of Helm and Kubernetes:

Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-eks-5876d6", GitCommit:"5876d6b7429820450950ade17fe7b4bf5ccada7f", GitTreeState:"clean", BuildDate:"2019-09-24T20:54:25Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
rally25rs commented 5 years ago

I think this might be a duplicate of https://github.com/helm/charts/issues/14893 It's probably using the pwd from the previous helm deploy. What a pain 😞

It might be a good idea to put something in the chart's README noting that once this is deployed once with Helm, deleting and redeploying will result in a broken deployment unless you delete the PVC or are using a specific password instead of a generated one.

I can imagine a lot of people will run into this issue...

andresbono commented 5 years ago

Thanks for the feedback, @rally25rs. Where would you have expected to see that note in the README? Would you like to open a PR including the note? I guess it can be placed under this section: https://github.com/bitnami/charts/tree/master/upstreamed/rabbitmq#persistence

stale[bot] commented 4 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

stale[bot] commented 4 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.