docker-library / rabbitmq

Docker Official Image packaging for RabbitMQ
http://www.rabbitmq.com/
MIT License
783 stars 412 forks source link

Service crashing after updating from 3.17.18 -> 3.17.19 #375

Closed toddgardner closed 4 years ago

toddgardner commented 5 years ago

I had my rabbit pinned at 3.17, and when 3.17.19 rolled out (on server restart) the workers started crashing the log messages below.

This is similar to https://github.com/docker-library/rabbitmq/issues/367#issuecomment-530999422 but I don't use those particular options; I'm not clear yet which ones could be causing an issue.

** Last message in was {add,rabbitmq_management_tcp,[{cowboy_opts,[{sendfile,false}]},{port,15672}],#Fun<rabbit_web_dispatch.0.72790530>,[{'_',[],[{[],[],cowboy_static,{priv_file,rabbitmq_management,"www/index.html"}},{[<<"api">>,<<"overview">>],[],rabbit_mgmt_wm_overview,[]},{[<<"api">>,<<"cluster-name">>],[],rabbit_mgmt_wm_cluster_name,[]},{[<<"api">>,<<"nodes">>],[],rabbit_mgmt_wm_nodes,[]},{[<<"api">>,<<"nodes">>,node],[],rabbit_mgmt_wm_node,[]},{[<<"api">>,<<"nodes">>,node,<<"memory">>],[],rabbit_mgmt_wm_node_memory,[absolute]},{[<<"api">>,<<"nodes">>,node,<<"memory">>,<<"relative">>],[],rabbit_mgmt_wm_node_memory,[relative]},{[<<"api">>,<<"nodes">>,node,<<"memory">>,<<"ets">>],[],rabbit_mgmt_wm_node_memory_ets,[absolute]},{[<<"api">>,<<"nodes">>,node,<<"memory">>,...],...},...]}],...}
** When Server state == undefined
** Reason for termination ==
** {{incompatible_listeners,{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672}]},{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672},{ssl,false}]}},[{rabbit_web_dispatch_registry,handle_call,3,[{file,"src/rabbit_web_dispatch_registry.erl"},{line,92}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
** Client <0.653.0> stacktrace
** [{gen,do_call,4,[{file,"gen.erl"},{line,167}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,219}]},{rabbit_web_dispatch,register_context_handler,5,[{file,"src/rabbit_web_dispatch.erl"},{line,35}]},{rabbit_mgmt_app,start_listener,3,[{file,"src/rabbit_mgmt_app.erl"},{line,132}]},{rabbit_mgmt_app,'-start_configured_listener/2-lc$^0/1-0-',3,[{file,"src/rabbit_mgmt_app.erl"},{line,55}]},{rabbit_mgmt_app,'-start_configured_listener/2-lc$^0/1-0-',3,[{file,"src/rabbit_mgmt_app.erl"},{line,56}]},{rabbit_mgmt_app,start,2,[{file,"src/rabbit_mgmt_app.erl"},{line,37}]},{application_master,start_it_old,4,[{file,"application_master.erl"},{line,277}]}]
2019-10-08 04:24:31.052 [error] <0.594.0> CRASH REPORT Process rabbit_web_dispatch_registry with 0 neighbours exited with reason: {incompatible_listeners,{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672}]},{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672},{ssl,false}]}} in rabbit_web_dispatch_registry:handle_call/3 line 92
2019-10-08 04:24:31.052 [error] <0.593.0> Supervisor rabbit_web_dispatch_sup had child rabbit_web_dispatch_registry started with rabbit_web_dispatch_registry:start_link() at <0.594.0> exit with reason {incompatible_listeners,{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672}]},{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672},{ssl,false}]}} in context child_terminated
2019-10-08 04:24:31.053 [error] <0.652.0> CRASH REPORT Process <0.652.0> with 0 neighbours exited with reason: {{incompatible_listeners,{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672}]},{"RabbitMQ Management",[{cowboy_opts,[{sendfile,false}]},{port,15672},{ssl,false}]}},{gen_server,call,[rabbit_web_dispatch_registry,{add,rabbitmq_management_tcp,[{cowboy_opts,[{sendfile,false}]},{port,15672}],#Fun<rabbit_web_dispatch.0.72790530>,[{'_',[],[{[],[],cowboy_static,{priv_file,rabbitmq_management,"www/index.html"}},{[<<"api">>,<<"overview">>],[],rabbit_mgmt_wm_overview,[]},{[<<"api">>,...],...},...]}],...},...]}} in application_master:init/4 line 138

rabbitmq.conf

      ## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more.
      cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
      cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
      cluster_formation.randomized_startup_delay_range.min = 0
      cluster_formation.randomized_startup_delay_range.max = 2
      cluster_formation.k8s.address_type = hostname
      ## How often should node cleanup checks run?
      cluster_formation.node_cleanup.interval = 10
      ## Set to false if automatic removal of unknown/absent nodes
      ## is desired. This can be dangerous, see
      ##  * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
      ##  * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
      cluster_formation.node_cleanup.only_log_warning = true
      cluster_partition_handling = autoheal
      ## See http://www.rabbitmq.com/ha.html#master-migration-data-locality
      queue_master_locator=min-masters
      ## See http://www.rabbitmq.com/access-control.html#loopback-users
      loopback_users.guest = false
      ## Memory-based Flow Control threshold
      vm_memory_high_watermark.absolute = 256MB

and setting env vars:

RABBITMQ_USE_LONGNAME=true
RABBITMQ_NODENAME=rabbit@$(MY_POD_NAME).rabbitmq-discovery-prod.default.svc.cluster.local
K8S_SERVICE_NAME=rabbitmq-discovery-prod
K8S_HOSTNAME_SUFFIX=.rabbitmq-discovery-prod.default.svc.cluster.local
RABBITMQ_DEFAULT_USER=rabbit
RABBITMQ_DEFAULT_PASS=<secret>
RABBITMQ_ERLANG_COOKIE=<secret>
wglambert commented 5 years ago

This is the only related issue I could find https://github.com/rabbitmq/rabbitmq-website/issues/841

michaelklishin commented 5 years ago

As explained in https://github.com/rabbitmq/rabbitmq-website/issues/841, our guess is that this image produces an incorrect config file with a certain combination of environment variables. RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS are variables used only by this image, not RabbitMQ itself, and there are no reasons to configure those values via environment variables. Use the config file.

michaelklishin commented 5 years ago

On an unrelated notes, this snippet

loopback_users.guest = false

suggests this is a production environment yet remote access for user guest is enabled. This is a terrible idea. Yes, if you override default user credentials technically it does not matter but to me it suggests that an example is being taken into production without a review.

toddgardner commented 5 years ago

Hmm, I'm unable to replicate the incident. Rolling forward and repeatedly restarting workers on a staging environment, I can't get the problem to repeat. Yeesh.

@michaelklishin I have thoroughly reviewed my setup and I'm familiar with the security implications; as you have said, it doesn't matter and adding a layer of user management to my rabbit set up just to make people who read the config feel better does not increase my security posture and creates management headaches; with my current setup, I can wipe and recreate the cluster with no effect (other than delayed processing) on my production systems.

Similarly, using the config file over the environment variable sticks me with either embedding secrets where the don't belong or rewriting https://github.com/docker-library/rabbitmq/blob/master/docker-entrypoint.sh to get the same guarantees, and there's no reason to suspect my version of docker-entrypoint.sh would be free of the same problem.

roroettg commented 5 years ago

I ran into this issue today and it seems the rabbitmq website does not like following lines of the generated config file.

management.listener.port = 15672 
management.listener.ssl = false

After removing these lines everything was fine.

michaelklishin commented 4 years ago

@wglambert this issue likely has nothing to do with the image at this point. We have introduced different management.* listener settings (the original ones haven't been removed) in 3.7.x and gradually updated the docs to use them. They match regular TCP listener syntax closer and allow for dual HTTP/HTTPS configurations. There is a good chance that some users try to use newer settings with older images. This is not something image-specific.