airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Nerve fails to restart on watcher failure #59

Closed Jaykah closed 9 years ago

Jaykah commented 10 years ago

Saw a similar topic somewhere, but since the fix has been apparently merged, decided to open a new issue.

I am using a simple mysql check to register the members of a Galera Cluster.

I, [2014-08-30T09:47:26.757807 #40790]  INFO -- Nerve::Reporter::Zookeeper: nerve: successfully created zk connection to x.example.com:2181,x2.example.com:2181,x3.example.com:2181/services/database
I, [2014-08-30T09:47:26.776437 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 initial check returned true
I, [2014-08-30T09:47:26.803240 #40790]  INFO -- Nerve::ServiceWatcher: nerve: service db is now up
I, [2014-08-30T13:58:51.491719 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>
I, [2014-08-30T14:00:08.381207 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service check user@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>
I, [2014-08-30T17:22:00.684380 #40790]  INFO -- Nerve::ServiceCheck::MySQLServiceCheck: nerve: service checkuser@10.1.1.1 got error #<RuntimeError: failed to connect with mysql: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
>

After which the checks stop, and although the node has already been restored, it fails to register in Zookeeper.

marcuscavalcanti commented 10 years ago

Hi Guys,

I have a similar a problem.

Nerve works very well to unregister an instance with problems (based on health/ping checks), but when this same instance back to work nerve doesn't register this instance in ZK.

If I force a restart in nerve everything works perfectly, but this is not a elegant way to fix the problem.

jolynch commented 9 years ago

This should be fixed with 86aa80409a and ab1388a253 which made it so that the nerve process watches the reporters and forces them to start again if they exited with an error, and if we get a zookeeper session expiry we recreate ephemeral nodes as soon as we can re-establish connection.

jolynch commented 9 years ago

Please let me know if you are still seeing this issue, and we can re-open and dive into it more.