airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

More than one reporter? #47

Closed alexanderjardim-zz closed 7 years ago

alexanderjardim-zz commented 10 years ago

Hi,

I am thinking of using Smart Stack for service discovery. Nowadays, I am using monit to keep my services alive and do health checks.

As I would be using nerve to do the same health checks, it came to my mind if it make any sense to have to configure the same health check in 2 places, and do each one twice, or if it would have any sense on having nerve report on my alarm system and zk, at the same time for the same unsuccessful health check.

So, does it make any sense to have more than one reporter registered at the same time?

alexanderjardim-zz commented 10 years ago

Just marking my issue as a question

igor47 commented 10 years ago

Short answer: just use both nerve and monit side by side.

Long answer:

There are two types of things here:

We have two separate components. Nerve only does health checks, and the only action it takes is publishing the results of those health checks. Synapse acts on the results of the health check to configure haproxy.

We have additional alerting capabilities at Airbnb which also consume the results of the health checks to generate alerts, like monit would do. However, we don't have any component that actively tries to restore health. This is because we're worried that actively trying to restore health would only cause more problems.

Ideally, if your code encounters bugs, it would fail fast. We use runit to run all of our services, so they would get automatically restarted. This is how we run nerve and synapse in prod as well.

However, if you are failing health checks because of failing upstream dependencies, restarting the service would not help and might cause harm as a starting service hammers your dependencies. This would argue against the use of monit for actively intervening in failing health checks.

alexanderjardim-zz commented 10 years ago

Ok, forget about monit restarting my services. Point is: both monit and nerve will do same health checks. Monit will start my alarm routines and nerve will notify zk one node is out. Does it make sense to put both alarm reporting and service discovery reporting on nerve?

igor47 commented 10 years ago

like i said, i think it is best to have a separate tool to do the alerting, which is linked with all of the rest of your systems alerting; we're planning on open-sourcing such a tool soon.

jolynch commented 7 years ago

I think there are lots of options here, either what igor has mentioned or you can do what we do at Yelp and monitor the other end of the equation in Synapse (check that enough instances are actually in HAProxy).

Since this hasn't had any action for a few years I'm going to close this. Feel free to re-open if these answers are insufficient :-)