Dynamic reconfiguration of service watchers

jolynch commented 8 years ago

@igor47 @jnb This PR should enable Nerve to dynamically add and remove registrations in response to a SIGHUP

Why? The zero downtime Nerve restart method that relies on starting duplicate nerves and relying on Synapse to de-duplicate registrations works until you start getting to hundreds of services per box. Then for every new service you schedule you have to send hundreds of ZK push updates across the entire infrastructure just to have Synapse go "oh yup it's the same". For us at Yelp this manifests as lots of Zookeeper "churn" (causing lots of snapshots) that has caused some instability for us.

How? This patch changes the main loop to listen for a SIGHUP, and when it receives one it diffs the new configuration against the in memory state, launching and reaping watchers as needed.

Amazingly all the specs still pass without me changing them, which makes me think that I've got some tests to write. This diff is full of some subtle ordering constraints (signals + threads + methods with side effects galore), so I'd like to get some eyes on it as I'm writing specs.

I also think I may have fixed a bug with shutdown where the main thread used to wait forever (.join without a timeout) if it couldn't reap a watcher... now we give watchers 10s to go away before we Thread.kill them.

jolynch commented 8 years ago

@EvanKrall, @solarkennedy this may be relevant to your interests.

In particular this change should make paasta services appear in SmartStack a lot faster (since we don't have to wait 60s, should only take ~2s to get a new registration now).

solarkennedy commented 8 years ago

Wow, this is pretty great. I agree this would significantly improve the "response time" of nerve, and eliminate our dual-nerve upstart stuff (right?)

While this PR pushes that complexity into the nerve daemon, I think it is the "right" place to put it and allows it to benefit everyone who uses smartstack. It also puts nerve in line with how much production-level daemons operate, which is nice. (the reload-on-HUP behavior aspect)

I can't really comment on the Ruby. All this thread management looks like messy stuff and I have no experience with such things. :(

igor47 commented 8 years ago

i feel like i've already suggested quite a refactor. does it make sense so far? i think given that we're doing much more complicated configuration management now, it makes sense to keep it all together in a single class.

do you want to take a stab at a re-architecture, and then i'll do another review?

jolynch commented 8 years ago

Sure thing, that makes sense to me, I'll have a refactor with some specs asap :-)

jolynch commented 8 years ago

@igor47 I think that I've refactored as you suggested. I've done some minimal testing locally and I'm going to start writing specs now ...

jolynch commented 8 years ago

Hrm, I also need to think about how this is going to interact with process supervisors. Basically if "restart" becomes "send SIGHUP" then we're not sure if the config is going to blow up nerve as it is right now. Working on this issue.

jolynch commented 8 years ago

Alright, I think I've incorporated your suggestions. Time for specs.

ubiquitousthey commented 8 years ago

Is there any way I can help get this to a point it can be merged. It would be very valuable. We have quite a bit of code in a supervisor that maintains two nerve instances to enable us to add services without taking existing services offline.

jolynch commented 8 years ago

@ubiquitousthey This isn't too far, I just have to write some proper specs and work on the feedback igor47 just gave me.

I'll try to finish this up this weekend :-)

jolynch commented 8 years ago

@igor47 I believe I've refactored to make it cleaner, along the way while testing I noticed a bug where I didn't properly handle service config changes due to how ruby does hash copies (rly the only way to do a deep copy is to use marshal...?).

Current status of this PR: Functionality: I think it mostly works. I can manually change the configuration file in various ways, send SIGHUP and see the proper registrations be removed, added, changed, etc ... I can use the check mode to ensure that the config is valid before making Nerve load it (this is technically a race condition but apparently Apache has this race condition as well and it works out ok).

Tests: I've written a basic test of the config_manager, but I still need to figure out how to properly test the SIGHUP changes. I'm working on those tests now.

My current concerns: The launch/reap in the "update" section is hella racey. There is no guarantee that the launching thread has time to register the service before the old watcher is reaped. I'm considering using a thread local variable to signal to the main thread that a watcher has actually started, which we can optionally wait for in launch_watcher. Also tests of all these shenanigans.

igor47 commented 8 years ago

i guess if we're concerned about the race conditions with registration and want to make sure that the service doesn't flap during reload, then now might be a good time to implement a hand-off? that is more correct and probably just as complex to implement as some sort of thread synchronization.

jolynch commented 8 years ago

I'm concerned about race conditions on the launch/reap cycle because we've been bitten by that in production (where the "new" nerve didn't have time to register all services before the "old" nerve went away). The first priority of Nerve has to be to keep services registered otherwise Synapse can ruin your day.

I guess the way I look at it is that this PR is strictly better than the status quo at preventing churn (where if you change any registration all registrations on the box have to bounce), and it introduces a lot of useful machinery for moving towards a truly hitless world.

I agree that some kind of reporter owned hand-off would be superior, but I would like to do it in a follow up PR if possible as this is already a fairly risky change. Presumably the refactor for handoff strategy would be something like exposing a handoff method from ServiceWatchers and instead of launch/reap we'd call handoff.

igor47 commented 8 years ago

that seems reasonable, carry on!

ubiquitousthey commented 8 years ago

Just a thought. What if you setup zk watchers and waited until you got matching ZK nodes or a timeout?

jolynch commented 8 years ago

@ubiquitousthey I think that's exactly the right direction, namely we need to wait until the first check_and_report finishes in the watcher. The slightly tricky part is that by default we don't want to wait until that happens (e.g. when Nerve starts up we want to fire up 100 threads that all connect to ZK asap and setup ephem nodes), we only want to wait when we're replacing an existing service_watcher.

This isn't that hard with thread local variables (a la :finish), I just have to do it heh.

ubiquitousthey commented 8 years ago

In Python or Java, I would use an Event for this kind of thread synchronization. A little googling did not reveal such a thing in Ruby. I did find a small implementation at https://emptysqua.re/blog/an-event-synchronization-primitive-for-ruby/ based on the Python implementation using Ruby's Mutex.

jolynch commented 8 years ago

@igor47 I've added a spec for the main nerve application that I think exercises at least some of the interesting edge cases. I was struggling to actually test the threading interactions with standard rspec so I added the rspec_wait gem which helped a little, but now the tests could flake. Also, I still can't do the normal "expect this got called with this" kind of spec tests because those don't seem to work across threads. If you know of a better way that doesn't involve mocking out all the threads please let me know (which sorta defeats the point of the test IMO since all the logic is asserting that the threads get started and reaped appropriately).

If you don't see anything crazy in this I'd like to potentially merge it and start deploying it at Yelp in dev/staging to get some real-world experience with it. If that goes well and we make it to prod I'll circle back and cut a new release.

@ubiquitousthey So ruby does have ConditionVariables but I find them super awkward because you have to have a shared mutex. I've just signaled back to the main thread with another thread local variable, which seems to work reasonably well.

jolynch commented 8 years ago

Ugh, crazy threading tests are introducing flakes ... I'll figure that out and loop back.

Edit: It turns out that subject is cached across examples ... that's not surprising at all. Should be fixed now :-)

igor47 commented 8 years ago

i'm surprised that subject is cached across examples. maybe something else was the problem there?

also, i would avoid actually creating threads during the spec run. one way to do that is like so:

before do
  allow(Thread).to receive(:new).and_yield
end

if you're testing stuff like "do we properly join the threads we start", i would do something like:

let(:thread_obj) { instance_double(Thread) }
before do
  allow(Thread).to receive(:new) do
    yield
    return thread_obj
  end
end

it 'properly joins the thread' do
  expect(thread_obj).to receive(:join)
  my_thing_that_starts_a_thread()
end

in general, to create more determinism, you should just run everything sequentially. remember, your goal is not to test whether Threads work -- the proper place to test that is in Ruby's own test suite, and we can assume that as a given. in your own specs, you just want to make sure you invoke the correct methods.

jolynch commented 8 years ago

Yea, I'm conflicted about this.

To give you some context, I originally started with mocking out the Threads (this is why I took so long writing specs, I wrote them 3 times) but then I realized that I wasn't testing much of the actual code, especially the coordination over thread local variables.

If I wanted confidence that the components like responsive sleep, thread local variables were working (e.g. writable by the main thread) etc ... the only way I build that confidence is by launching bonafide threads. The tolerances are sufficiently large (30s vs ~1s responsive sleeps) that I believe the likelihood of flakes outweigh the benefit from having "real" tests.

For what it's worth, we've found real integration tests to be much more useful for testing nerve and synapse than traditional "pure" unit tests. That's why I ended up leaning towards more of an integration test than a unit test, because as much as philosophy says one thing, the point of the tests are to help prevent bugs and assert contracts about how the code works, and because Nerve is so intimately tied to Threading behaviour (which changes in different ruby versions fwiw), I sorta lean towards using real Threads in the overall test as well.

igor47 commented 8 years ago

completely agree with you on the integration tests -- i wrote such a suite too, in https://github.com/airbnb/smartstack-cookbook/blob/master/files/default/tests/minitest/test_test.rb, but i haven't maintained it at all :(

i disagree that you need to run actual threads to write unit tests for specific functions. the argument you made basically indicts the entire idea of unit tests (they're not "real" tests). for example, you list responsive_sleep as a component which requires actual threads to test, but i think this is a prime example of a component which can be unit tested with high confidence.

either way, the rspec suite is intended as a unit test suite, and i think it should be kept that way. if we want to integration-test the overall behavior, we should leave that to other test suites.

jolynch commented 8 years ago

I definitely see where you're coming from, I'm not trying to argue against unit tests for small pieces of functionality. For example, I agree that responsive_sleep is imminently unit testable, and I can add a unit test for that if you want.

However, I do believe a test that uses real Threads is the more useful test of the Nerve Application than testing the interaction on mocks. The reality is that we do care whether Ruby thread local variables are writable from the main thread in Ruby 2.2 vs 2.1 vs 1.9, and we do care about backwards incompatible changes to the Thread class. The whole purpose of the Nerve Application is to launch, reap, and signal with the ServiceWatcher threads, so let's make sure that actually works. I have isolated the Application from e.g. signal handling semantics and the realities of what ServiceWatcher threads actually do, but I feel I left enough reality in the suite to give me confidence when making changes to this behavior.

I guess what I'm trying to say is that I'm not arguing against unit tests, I'm claiming the unit test of the Nerve application is more useful if it uses real Threads. I agree that makes it more like an integration test than a pure unit test, but it's still testing an isolated piece of functionality and I don't believe that we'll see that many flakes as a result. Are you ok with us going with this and seeing what happens or do you strongly believe that for it to be a unit test it can't spawn threads?

jolynch commented 8 years ago

Alright, I've got some train hacking to do. @igor47 I'll try to have the next revision ready for you to review tomorrow. Ignore my latest comment, if we push the threading down a level then the app doesn't have to care.

@ubiquitousthey I'm aiming to get this into a state it can be merged in tomorrow or maybe the next day, then it will take me a day to roll it out at Yelp, and if everything goes well I may be able to circle back by Friday to get a new version cut.

jolynch commented 8 years ago

@igor47 I went ahead and did the refactor you suggested to push the threading into the ServiceWatcher. I believe that I've managed to do that without breaking compatibility with existing ServiceWatchers, and it made writing tests a bit easier as well.

The nested expect closed over an iteration counter is a tad strange, but I wasn't quite sure how else to test that subsequent iterations of the main loop handled the state changes right without it. I sorta like the resulting tests though.

👍 ?

bobtfish commented 8 years ago

👍 from me.

igor47 commented 8 years ago

the tests look much better now. thanks for doing the refactor! i left a few minor comments. my sense is the code is ready, although the tests could still be improved a bit (i'm particularly concerned about what seems to be a lack of verifying doubles, even though they seem to be turned on here: https://github.com/airbnb/nerve/blob/master/spec/spec_helper.rb#L18-L22).

jolynch commented 8 years ago

@igor47 I feel I have integrated your suggestions and commented on the ones I didn't think needed to change. I believe this is pretty ready for merge. The ServiceWatcher refactor has made it clean and by switching to instance_doubles I checked that I did indeed get verified doubles. Also I've been doing fairly significant manual local testing. I kind of want to get this merged and deployed on real hardware.

Once it's deployed and I know it works on production I can loop back and cut a new release (if I find bugs I'll write regression tests on them and fix before cutting the release).

igor47 commented 8 years ago

this PR is epic. nice work!

jolynch commented 8 years ago

@igor47 is that a 👍 ?

jolynch commented 8 years ago

Sweet, thanks @igor47 for all the review! I'm going to roll this out at Yelp and see what happens. If good things happen, I'll circle back and cut a release. I don't think that anything is non backwards compatible either on the watcher or config side.

jolynch commented 8 years ago

@ubiquitousthey this should be available in 0.8.0, let me know if you run into any issues.

Also this should close out #62 ... only a year late

ubiquitousthey commented 8 years ago

Thanks @jolynch. I will give it a try.

airbnb / nerve

Dynamic reconfiguration of service watchers #80