flupke / rainbow-saddle

A wrapper around gunicorn to handle graceful restarts correctly
48 stars 12 forks source link

Sometimes HUP puts rainbow-saddle in state where gunicorn is stopped #1

Closed asavoy closed 9 years ago

asavoy commented 9 years ago

Firstly, I'm not sure if rainbow-saddle is being used/maintained, and whether feedback is desired :)

Presuming an affirmative to the above, I'm seeing for my project that 1 in 4 deployments that involve a HUP restart are leading to putting rainbow-saddle in a state where gunicorn is no longer running.

The sequence of events looks a bit like this:

  1. The rainbow-saddle/gunicorn process for the site is already running under /srv/myapp/current, which is a symlink. The command is:

    /srv/myapp/shared/env/bin/rainbow-saddle /srv/myapp/shared/env/bin/gunicorn wsgi:application -c /srv/myapp/shared/gunicorn_config.py --chdir /srv/myapp/current
  2. Write the new version of the site into a new directory.
  3. Update symlink /srv/myapp/current to point to the new version of the site.
  4. Send HUP to the rainbow-saddle process, using something like supervisorctl pid myapp | xargs kill -s HUP.

Usually this works, but sometimes it turns out that the gunicorn process has stopped (I'm not seeing any subprocess for rainbow-saddle in htop), and there's this in STDERR:

------------------------------------------------------------------------------
Starting new arbiter
------------------------------------------------------------------------------
Uncaught exception in signal handler <function restart_arbiter at 0x1c09cf8>
Traceback (most recent call last):
  File "/srv/myapp/shared/env/local/lib/python2.7/site-packages/rainbowsaddle/__init__.py", line 22, in wrapper
    return func(*args, **kwargs)
  File "/srv/myapp/shared/env/local/lib/python2.7/site-packages/rainbowsaddle/__init__.py", line 56, in restart_arbiter
    os.kill(self.arbiter_pid, signal.SIGUSR2)
OSError: [Errno 3] No such process

If I try to send HUP again, I'll get that same output again.

Some other things to help understand the problem:

Wondering if there are any ideas for what I should do, that might help unearth the problem?

flupke commented 9 years ago

Sorry for the late reply I missed the notification :(

That means gunicorn's main process (what they call the arbiter) is not running, could something in your deploy script have stopped it before the HUP signal?

And yes we use rainbow-saddle in production on a fairly busy app, but had so many problems with gunicorn 0.19+ that we stick to gunicorn-0.16.1 (my fork, which fixes an issue with unix sockets: https://github.com/flupke/gunicorn/tree/0.16.1_keep_unix_sockets), and rainbow-saddle 0.1.1

asavoy commented 9 years ago

Thanks for the reply!

I can't see any obvious reason for the arbiter to stop. I've tried to reproduce it with reruns of deploys and restarts to no luck. So it looks like I'll add logging and hope to get enough information from next time it happens. If anything interesting comes up, I'll report back here.

asavoy commented 9 years ago

I think I've gotten to the bottom of this. When rainbow-saddle signals USR2 to gunicorn, gunicorn tries to os.chdir() into the path it was originally started from (despite providing the --chdir argument). This throws an OSError when that path has been removed after starting - this happened in my case because we're doing a Capistrano-style deploy where each deployment is pushed into a new dir and old dirs get cleaned up. That breaks the new gunicorn process, and rainbow-saddle terminates the old one, leaving the server in an inaccessible state.

The workaround is to ensure the rainbow-saddle command is executed from a path that won't get removed and use the gunicorn --chdir argument.

flupke commented 9 years ago

Thank you, great detective job. I'll add notes in the readme.