airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Dynamic config2 #16

Closed igor47 closed 9 years ago

igor47 commented 10 years ago

same as https://github.com/airbnb/nerve/pull/10 but rebased on current master

@pcarrier @brndnmtthws

igor47 commented 10 years ago

warning: DO NOT MERGE THIS!

brndnmtthws commented 10 years ago

I added a new thing recently:

igor47 commented 10 years ago

sorry for the delay; in general, i'm not opposed to the concept here. but i'm not a fan of the implementation. i would like to cleaner separation of concerns, with a cleaner interface for dynamically adding and removing services in the service watcher and a better-documented and separated out server. i'm also not sure if eventmachine is the right approach. i would not merge this as-is.

also, i think @pcarrier has additional concerns about the concept as a whole. pierre, can you elaborate?

brndnmtthws commented 10 years ago

I'm not overly attached to this implementation--it's just a minimum viable product. The intention is to be able to make progress on other things while you guys decide what the overall direction is for dynamic service discovery. We need functionality like this, but it doesn't necessarily need to be done this way.

That said, we've been using this for a few weeks now without any trouble.

igor47 commented 10 years ago

we talked in person about how to move forward on this. @brndnmtthws is going to (a) split out the server code (b) improve the interface for setting up additional watchers internally and (c) think about improving the TCP protocol for service registration to be more resilient to currently unknown effects.

this should be within the next week.

brndnmtthws commented 10 years ago

I'm made progress on this, but I'm stuck on an issue with JRuby right now. I can't seem to get the eventmachine jar to load correctly without forcing it to use the ruby version of eventmachine rather than the java one.

brndnmtthws commented 10 years ago

Bump. Added some documentation to the readme.

brndnmtthws commented 10 years ago

@igor47 @pcarrier

Bump!

Also, this is a bit of a duplicate of #17 since it resolves some JRuby issues.

brndnmtthws commented 10 years ago

Updated and rebased. Might be completely broken now, however.

Can we get this merged please? Thanks.

brndnmtthws commented 10 years ago

Still need this, fyi.

brndnmtthws commented 10 years ago

Yep, still important.

brndnmtthws commented 10 years ago

@airbnb/sre

Yep, still need this.

The original PR (https://github.com/airbnb/nerve/pull/10#issue-20776334) was opened 5 months ago. What's the problem?

igor47 commented 10 years ago

what does "might be completely broken now" mean?

brndnmtthws commented 10 years ago

You changed a bunch of things and ignored this PR, so for all I know this branch is broken now. I'll try and test it today, but it's rather frustrating.

brndnmtthws commented 10 years ago

I've fixed the breakage, everything works again.

pcarrier commented 10 years ago

I haven't ignored this, I've suggested running multiple nerve instances to achieve the same effect.

You've ignored my recommendation.

brndnmtthws commented 10 years ago

Your suggestion was never ignored. We discussed it, and our team decided this was the best path. On Mar 13, 2014 12:35 PM, "Pierre Carrier" notifications@github.com wrote:

I haven't ignored this, I've suggested running multiple nerve instances to achieve the same effect.

You've ignored my recommendation.

Reply to this email directly or view it on GitHubhttps://github.com/airbnb/nerve/pull/16#issuecomment-37577032 .

nelgau commented 10 years ago

I've observed that this version of Nerve fails to announce services and logs that a service is "up" when no mention of that host can be found in Zookeeper.

Consider host i-72a9b25c:

I, [2014-03-26T21:09:46.253000 #26106]  INFO -- Nerve::ServiceCheck::TcpServiceCheck: nerve: service check kafka-h1 tcp-xx.xx.xxx.xxx:31124 initial check returned true
I, [2014-03-26T21:09:46.268000 #26106]  INFO -- Nerve::ServiceWatcher: nerve: service kafka-h1 is now up

However, Zookeeper has no i-72a9b25c registered. Restarting Nerve corrected the issue. A second instance had the same pathology and it was also corrected by a restart.

I received a stack trace from @brndnmtthws on Tuesday that might be of some help to track down the cause of the issue: https://gist.github.com/nelgau/aad34287936abce3ca08

brndnmtthws commented 10 years ago

I tracked the problem @nelgau mentioned above in to a bug in how nerve handles ZK nodes. I've pushed a patch here:

https://github.com/airbnb/nerve/commit/4d5690af5664c35c64b8960678b9d55151475836

With ZooKeeper, in the event of a session timeout (or other interruption) you need to recreate ephemeral nodes. It now correctly recreates the nodes when they are lost.

jolynch commented 9 years ago

I'm going to close this because I don't think that dynamic nerve/synapse config reloads should be the goal going forward. We can reconsider if our current hitless system turns out not to scale (very possible given the design), in which case we'll work on it then.

Basically, you can always restart nerve without dropping traffic by starting two nerves and then cycling between them. I actually think the system is easier to understand this way because the config is the source of truth for what nerve/synapse are doing.