Closed igor47 closed 9 years ago
warning: DO NOT MERGE THIS!
I added a new thing recently:
sorry for the delay; in general, i'm not opposed to the concept here. but i'm not a fan of the implementation. i would like to cleaner separation of concerns, with a cleaner interface for dynamically adding and removing services in the service watcher and a better-documented and separated out server. i'm also not sure if eventmachine is the right approach. i would not merge this as-is.
also, i think @pcarrier has additional concerns about the concept as a whole. pierre, can you elaborate?
I'm not overly attached to this implementation--it's just a minimum viable product. The intention is to be able to make progress on other things while you guys decide what the overall direction is for dynamic service discovery. We need functionality like this, but it doesn't necessarily need to be done this way.
That said, we've been using this for a few weeks now without any trouble.
we talked in person about how to move forward on this. @brndnmtthws is going to (a) split out the server code (b) improve the interface for setting up additional watchers internally and (c) think about improving the TCP protocol for service registration to be more resilient to currently unknown effects.
this should be within the next week.
I'm made progress on this, but I'm stuck on an issue with JRuby right now. I can't seem to get the eventmachine jar to load correctly without forcing it to use the ruby version of eventmachine rather than the java one.
Bump. Added some documentation to the readme.
@igor47 @pcarrier
Bump!
Also, this is a bit of a duplicate of #17 since it resolves some JRuby issues.
Updated and rebased. Might be completely broken now, however.
Can we get this merged please? Thanks.
Still need this, fyi.
Yep, still important.
@airbnb/sre
Yep, still need this.
The original PR (https://github.com/airbnb/nerve/pull/10#issue-20776334) was opened 5 months ago. What's the problem?
what does "might be completely broken now" mean?
You changed a bunch of things and ignored this PR, so for all I know this branch is broken now. I'll try and test it today, but it's rather frustrating.
I've fixed the breakage, everything works again.
I haven't ignored this, I've suggested running multiple nerve instances to achieve the same effect.
You've ignored my recommendation.
Your suggestion was never ignored. We discussed it, and our team decided this was the best path. On Mar 13, 2014 12:35 PM, "Pierre Carrier" notifications@github.com wrote:
I haven't ignored this, I've suggested running multiple nerve instances to achieve the same effect.
You've ignored my recommendation.
Reply to this email directly or view it on GitHubhttps://github.com/airbnb/nerve/pull/16#issuecomment-37577032 .
I've observed that this version of Nerve fails to announce services and logs that a service is "up" when no mention of that host can be found in Zookeeper.
Consider host i-72a9b25c
:
I, [2014-03-26T21:09:46.253000 #26106] INFO -- Nerve::ServiceCheck::TcpServiceCheck: nerve: service check kafka-h1 tcp-xx.xx.xxx.xxx:31124 initial check returned true
I, [2014-03-26T21:09:46.268000 #26106] INFO -- Nerve::ServiceWatcher: nerve: service kafka-h1 is now up
However, Zookeeper has no i-72a9b25c
registered. Restarting Nerve corrected the issue.
A second instance had the same pathology and it was also corrected by a restart.
I received a stack trace from @brndnmtthws on Tuesday that might be of some help to track down the cause of the issue: https://gist.github.com/nelgau/aad34287936abce3ca08
I tracked the problem @nelgau mentioned above in to a bug in how nerve handles ZK nodes. I've pushed a patch here:
https://github.com/airbnb/nerve/commit/4d5690af5664c35c64b8960678b9d55151475836
With ZooKeeper, in the event of a session timeout (or other interruption) you need to recreate ephemeral nodes. It now correctly recreates the nodes when they are lost.
I'm going to close this because I don't think that dynamic nerve/synapse config reloads should be the goal going forward. We can reconsider if our current hitless system turns out not to scale (very possible given the design), in which case we'll work on it then.
Basically, you can always restart nerve without dropping traffic by starting two nerves and then cycling between them. I actually think the system is easier to understand this way because the config is the source of truth for what nerve/synapse are doing.
same as https://github.com/airbnb/nerve/pull/10 but rebased on current master
@pcarrier @brndnmtthws