errbotio / errbot

Errbot is a chatbot, a daemon that connects to your favorite chat service and bring your tools and some fun into the conversation.
http://errbot.io
GNU General Public License v3.0
3.13k stars 615 forks source link

All webhooks hang after doing !restart #175

Closed lkraav closed 10 years ago

lkraav commented 11 years ago

This is on current master HEAD

gbin commented 11 years ago

I think I found the bug, the culprit is the shutdown of Rocket, I made a minimal reproduction case here:

Basically Rocket never stops, this is also why the unit tests are failing at the moment.

from wsgiref.util import setup_testing_defaults
from rocket import Rocket
from time import sleep
import socket

def webserver_on(host, port):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.connect((host, port))
        s.shutdown(socket.SHUT_RDWR)
        s.close()
        return True
    except:
        return False

def simple_app(environ, start_response):
    setup_testing_defaults(environ)

    status = '200 OK'
    headers = [('Content-type', 'text/plain')]

    start_response(status, headers)

    ret = ["%s: %s\n" % (key, value)
           for key, value in environ.iteritems()]
    return ret

def test_startup_shutdown_leak_rocket():
    p = 3333
    r = Rocket(('localhost', p),  app_info={'wsgi_app': simple_app})
    r.start(True)
    maxattempt = 5
    while not webserver_on('localhost', p):
        print("Waiting for the server to start %i" % maxattempt)
        sleep(0.5)
        maxattempt -= 1
        if not maxattempt:
            raise Exception("Rocket did not start")
    r.stop()
    maxattempt = 5
    while webserver_on('localhost', p):
        print("Waiting for the server to stop %i" % maxattempt)
        sleep(0.5)
        maxattempt -= 1
        if not maxattempt:
            raise Exception("Rocket never stopped !")

if __name__ == '__main__':
    test_startup_shutdown_leak_rocket()
zoni commented 11 years ago

I believe the issue originates in listener.py of Rocket. In listen we find the following piece of code:

            except socket.timeout:
                # socket.timeout will be raised every THREAD_STOP_CHECK_INTERVAL
                # seconds.  When that happens, we check if it's time to die.

                if not self.ready:
                    if __debug__:
                        self.err_log.debug('Listener exiting.')
                    return
                else:
                    continue
            except:
                self.err_log.error(str(traceback.format_exc()))

However, this socket.timeout is never raised when running the above code which reproduces the issue.

gbin commented 11 years ago

@zoni

This patch on rocket makes my little test pass (forcibly close the listening socket). Do you think it is the correct way to do that ?

diff --git a/rocket/listener.py b/rocket/listener.py
index 7b54bea..157663b 100644
--- a/rocket/listener.py
+++ b/rocket/listener.py
@@ -113,40 +113,44 @@ class Listener(Thread):
             # secure socket. We don't do anything because it will be detected
             # by Worker and dealt with appropriately.
             pass
-        
+
         return sock

     def start(self):
         if not self.ready:
             self.err_log.warning('Listener started when not ready.')
             return
-            
+
         if self.thread is not None and self.thread.isAlive():
             self.err_log.warning('Listener already running.')
             return
-            
+
         self.thread = Thread(target=self.listen, name="Port" + str(self.port))
-        
+
         self.thread.start()
-    
+
+    def stop(self):
+        self.ready = False
+        self.listener.shutdown(socket.SHUT_RDWR)
+
     def isAlive(self):
         if self.thread is None:
             return False
-        
+
         return self.thread.isAlive()

     def join(self):
         if self.thread is None:
             return
-            
+
         self.ready = False
-        
+
         self.thread.join()
-        
+
         del self.thread
         self.thread = None
         self.ready = True
-    
+
     def listen(self):
         if __debug__:
             self.err_log.debug('Entering main loop.')
@@ -161,15 +165,13 @@ class Listener(Thread):
                                        self.interface[1],
                                        self.secure))

-            except socket.timeout:
+            except:
                 # socket.timeout will be raised every THREAD_STOP_CHECK_INTERVAL
                 # seconds.  When that happens, we check if it's time to die.
-
                 if not self.ready:
                     if __debug__:
                         self.err_log.debug('Listener exiting.')
                     return
                 else:
+                    self.err_log.error(str(traceback.format_exc()))
                     continue
-            except:
-                self.err_log.error(str(traceback.format_exc()))
diff --git a/rocket/main.py b/rocket/main.py
index b8d7a01..baf4499 100644
--- a/rocket/main.py
+++ b/rocket/main.py
@@ -159,7 +159,7 @@ class Rocket(object):
         try:
             # Stop listeners
             for l in self.listeners:
-                l.ready = False
+                l.stop()

             # Encourage a context switch
             time.sleep(0.01)
zoni commented 11 years ago

With that patch applied, I got an AttributeError about stop() not being defined.

However, while playing around with it, I noticed that my earlier observation was wrong, and it appears socket.timeout is being raised at the right times after all.

This gave me the idea however to forcibly close the socket at that point, like in your patch, as in so:

diff --git a/rocket/listener.py b/rocket/listener.py
index e1a522b..3d2e9b1 100644
--- a/rocket/listener.py
+++ b/rocket/listener.py
@@ -165,6 +165,7 @@ class Listener(Thread):
                 if not self.ready:
                     if __debug__:
                         self.err_log.debug('Listener exiting.')
+                    self.listener.shutdown(socket.SHUT_RDWR)
                     return
                 else:
                     continue

This also still makes the reproduction case stop failing! I then ran some more testing, and as far as I could see, it doesn't appear like this causes any bad side effects.

I made some long running requests, and had Rocket stopping while these requests were still being processed. The server nicely waited to complete the requests before completely shutting down, and didn't cause connections to be cut off in the middle.

So I think this is the correct way to do it.

zoni commented 11 years ago

I'm wondering what we should do about this with regards to waiting for upstream.. Seeing I reported a bug as well literally at the start of this year with a patch included, and it hasn't had a single comment, we might need to fork and patch Rocket ourselves so that we don't end up delaying our own release for who knows how long.

I'd also rather not give up on Rocket per se, seeing it gives us two huge benefits compared to many other Python WSGI servers:

Thoughts?

gbin commented 11 years ago

We can embed a patched rocket ?

zoni commented 11 years ago

Yeah, I suppose that would work. It might be slightly nicer to put it in a separate repository though, just so other people may find it more easily and benefit from the work as well?

zoni commented 11 years ago

Okay, I threw up a patched Rocket for now. Anyone wanting to get rid of this bug for right now, without waiting for the next release of Err, could just pip install https://github.com/zoni/Rocket/archive/master.zip.

I can confirm it fixes the webhooks hanging after a restart. Oddly, it doesn't fix our tests though, those are still giving odd errors. So we'll need to investigate that further as well.

gbin commented 11 years ago

Yes I was debugging that also on my side yesterday, the patched rocket is half of the problem : during reconfiguration it looks like the WebServer is not deactivated and activated again.

zoni commented 11 years ago

Well, I traced it down, and the webserver plugin is getting deactivated and activated again correctly.

For some bizarre reason, the tests also pass fine when run one file at a time. At this point, the only conclusion I have is that nose is doing something odd, but I dunno.. I worked around it by altering our test script, which is ugly, but at least it works. See #180.

I'm also convinced that patching rocket fixed the original issue, irregardless of what the tests were saying, so I say we close this issue as soon as the next version ships.

gbin commented 11 years ago

Thx for the investigation !