Closed lkraav closed 10 years ago
I think I found the bug, the culprit is the shutdown of Rocket, I made a minimal reproduction case here:
Basically Rocket never stops, this is also why the unit tests are failing at the moment.
from wsgiref.util import setup_testing_defaults
from rocket import Rocket
from time import sleep
import socket
def webserver_on(host, port):
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
s.connect((host, port))
s.shutdown(socket.SHUT_RDWR)
s.close()
return True
except:
return False
def simple_app(environ, start_response):
setup_testing_defaults(environ)
status = '200 OK'
headers = [('Content-type', 'text/plain')]
start_response(status, headers)
ret = ["%s: %s\n" % (key, value)
for key, value in environ.iteritems()]
return ret
def test_startup_shutdown_leak_rocket():
p = 3333
r = Rocket(('localhost', p), app_info={'wsgi_app': simple_app})
r.start(True)
maxattempt = 5
while not webserver_on('localhost', p):
print("Waiting for the server to start %i" % maxattempt)
sleep(0.5)
maxattempt -= 1
if not maxattempt:
raise Exception("Rocket did not start")
r.stop()
maxattempt = 5
while webserver_on('localhost', p):
print("Waiting for the server to stop %i" % maxattempt)
sleep(0.5)
maxattempt -= 1
if not maxattempt:
raise Exception("Rocket never stopped !")
if __name__ == '__main__':
test_startup_shutdown_leak_rocket()
I believe the issue originates in listener.py
of Rocket. In listen
we find the following piece of code:
except socket.timeout:
# socket.timeout will be raised every THREAD_STOP_CHECK_INTERVAL
# seconds. When that happens, we check if it's time to die.
if not self.ready:
if __debug__:
self.err_log.debug('Listener exiting.')
return
else:
continue
except:
self.err_log.error(str(traceback.format_exc()))
However, this socket.timeout
is never raised when running the above code which reproduces the issue.
@zoni
This patch on rocket makes my little test pass (forcibly close the listening socket). Do you think it is the correct way to do that ?
diff --git a/rocket/listener.py b/rocket/listener.py
index 7b54bea..157663b 100644
--- a/rocket/listener.py
+++ b/rocket/listener.py
@@ -113,40 +113,44 @@ class Listener(Thread):
# secure socket. We don't do anything because it will be detected
# by Worker and dealt with appropriately.
pass
-
+
return sock
def start(self):
if not self.ready:
self.err_log.warning('Listener started when not ready.')
return
-
+
if self.thread is not None and self.thread.isAlive():
self.err_log.warning('Listener already running.')
return
-
+
self.thread = Thread(target=self.listen, name="Port" + str(self.port))
-
+
self.thread.start()
-
+
+ def stop(self):
+ self.ready = False
+ self.listener.shutdown(socket.SHUT_RDWR)
+
def isAlive(self):
if self.thread is None:
return False
-
+
return self.thread.isAlive()
def join(self):
if self.thread is None:
return
-
+
self.ready = False
-
+
self.thread.join()
-
+
del self.thread
self.thread = None
self.ready = True
-
+
def listen(self):
if __debug__:
self.err_log.debug('Entering main loop.')
@@ -161,15 +165,13 @@ class Listener(Thread):
self.interface[1],
self.secure))
- except socket.timeout:
+ except:
# socket.timeout will be raised every THREAD_STOP_CHECK_INTERVAL
# seconds. When that happens, we check if it's time to die.
-
if not self.ready:
if __debug__:
self.err_log.debug('Listener exiting.')
return
else:
+ self.err_log.error(str(traceback.format_exc()))
continue
- except:
- self.err_log.error(str(traceback.format_exc()))
diff --git a/rocket/main.py b/rocket/main.py
index b8d7a01..baf4499 100644
--- a/rocket/main.py
+++ b/rocket/main.py
@@ -159,7 +159,7 @@ class Rocket(object):
try:
# Stop listeners
for l in self.listeners:
- l.ready = False
+ l.stop()
# Encourage a context switch
time.sleep(0.01)
With that patch applied, I got an AttributeError about stop()
not being defined.
However, while playing around with it, I noticed that my earlier observation was wrong, and it appears socket.timeout
is being raised at the right times after all.
This gave me the idea however to forcibly close the socket at that point, like in your patch, as in so:
diff --git a/rocket/listener.py b/rocket/listener.py
index e1a522b..3d2e9b1 100644
--- a/rocket/listener.py
+++ b/rocket/listener.py
@@ -165,6 +165,7 @@ class Listener(Thread):
if not self.ready:
if __debug__:
self.err_log.debug('Listener exiting.')
+ self.listener.shutdown(socket.SHUT_RDWR)
return
else:
continue
This also still makes the reproduction case stop failing! I then ran some more testing, and as far as I could see, it doesn't appear like this causes any bad side effects.
I made some long running requests, and had Rocket stopping while these requests were still being processed. The server nicely waited to complete the requests before completely shutting down, and didn't cause connections to be cut off in the middle.
So I think this is the correct way to do it.
I'm wondering what we should do about this with regards to waiting for upstream.. Seeing I reported a bug as well literally at the start of this year with a patch included, and it hasn't had a single comment, we might need to fork and patch Rocket ourselves so that we don't end up delaying our own release for who knows how long.
I'd also rather not give up on Rocket per se, seeing it gives us two huge benefits compared to many other Python WSGI servers:
serve()
/ run()
, so it's a lot easier to integrate cleanly as a plugin,Thoughts?
We can embed a patched rocket ?
Yeah, I suppose that would work. It might be slightly nicer to put it in a separate repository though, just so other people may find it more easily and benefit from the work as well?
Okay, I threw up a patched Rocket for now. Anyone wanting to get rid of this bug for right now, without waiting for the next release of Err, could just pip install https://github.com/zoni/Rocket/archive/master.zip
.
I can confirm it fixes the webhooks hanging after a restart. Oddly, it doesn't fix our tests though, those are still giving odd errors. So we'll need to investigate that further as well.
Yes I was debugging that also on my side yesterday, the patched rocket is half of the problem : during reconfiguration it looks like the WebServer is not deactivated and activated again.
Well, I traced it down, and the webserver plugin is getting deactivated and activated again correctly.
For some bizarre reason, the tests also pass fine when run one file at a time. At this point, the only conclusion I have is that nose is doing something odd, but I dunno.. I worked around it by altering our test script, which is ugly, but at least it works. See #180.
I'm also convinced that patching rocket fixed the original issue, irregardless of what the tests were saying, so I say we close this issue as soon as the next version ships.
Thx for the investigation !
This is on current master HEAD