aweber / rabbitmq-autocluster

This project is now maintained by the RabbitMQ Team, visit the official repo @
https://github.com/rabbitmq/rabbitmq-autocluster
BSD 3-Clause "New" or "Revised" License
336 stars 120 forks source link

Cleanup Consul healthchecks #136

Closed adamlc closed 7 years ago

adamlc commented 7 years ago

Hey guys!

I've just been evaluating using this plugin for our nomad cluster. I've got everything up and running and all seems good!

I tried reducing the number of tasks in nomad, which worked fine with the CLEANUP_WARN_ONLY feature. While the nodes get removed from RabbitMQ the failing healthchecks stay in consul, which triggers our alerting systems. Should these healthchecks be cleaned up automatically when the container stops? I couldn't see anything in the docs. If not then in Consul 0.7+ theres a feature called deregister_critical_service_after that will automatically remove the failed check after a period of time.

Would appreciate your thoughts or if I'm doing something wrong! With the dynamic nature of nomad moving containers about this may crop up more!

michaelklishin commented 7 years ago

The Consul backend does unregister, are you saying that it should do more than that?

Also, a node can fail without ever having a chance to unregister, and in this case it's impossible to know if that node is coming back. Resolving things like this is really an operator's responsibility and arguably somewhat out of scope for this plugin.

michaelklishin commented 7 years ago

The point above is in part why removing a node from a cluster requires a CLI command in RabbitMQ's traditional clustering, and some other systems (e.g. Cassandra) do the same. In RabbitMQ master much of this plugin (but not the Consul part) is in the core, so unregistration can be performed when a node is removed. This plugin in its current form doesn't have any chance of hooking into CLI commands.

adamlc commented 7 years ago

Thanks for getting back to me so quick @michaelklishin :)

Fair enough, I suspected that may be the case! What are you thoughts on being able to specify deregister_critical_service_after in the consul healthcheck? That would suffice in most cases I think :)

michaelklishin commented 7 years ago

@adamlc I'm not a Consul expert but according to the docs that setting is exactly what we want here.

@Gsantomaggio would you agree?

Gsantomaggio commented 7 years ago

hi @adamlc yes I tested it and when it scales down the node is removed from RabbitMQ Cluster and not from Consul.

It seems that the unregister is not called directly from the plugin.

@michaelklishin yes I agree and I was thinking about there 3 options:

  1. Add a parameter to set deregister_critical_service_after
  2. Add the unregister code on stop (It should be safe)
  3. Try to find a way to trap the stop container and call something like: rabbitmqctl eval 'autocluster:unregister(). (this as temp workaround)
michaelklishin commented 7 years ago

Options 1 and 2 seem to be no-brainers to me. I am on the fence about option 3.

Gsantomaggio commented 7 years ago

We can do both 1 and 2, maybe adding a two params to not break the current behavior.

michaelklishin commented 7 years ago

Fixed in https://github.com/rabbitmq/rabbitmq-autocluster/pull/3.