buildbot / buildbot

Python-based continuous integration testing framework; your pull requests are more than welcome!
https://www.buildbot.net
GNU General Public License v2.0
5.25k stars 1.62k forks source link

Buildbot master crashed abruptly #3614

Open aj062 opened 7 years ago

aj062 commented 7 years ago

I am trying a simple multi-master configuration, with two master, one running web-server, and another handling rest of the stuff.

Buildbot webserver stopped abruptly (with below message in the logs). It seemed to happen just after there was a API request with invalid builder name.

http.log in buildbot directory:

4676 "127.0.0.1" - - [13/Sep/2017:13:04:02 +0000] "GET /api/v2/builders/MyBuilder%20Release%20Builder HTTP/1.0" 404 60 "https://build.domainname.com/"

Note: Multiple similar API requests with invalid builder names are present in the logs, but those didn't cause buildbot to crash.

twistd.log:

2017-09-13 13:02:51-0700 [-] added buildset 123 to database
2017-09-13 13:04:03-0700 [WampWebSocketClientProtocol,client] Guru meditation! We have been disconnected from wamp server
2017-09-13 13:04:03-0700 [WampWebSocketClientProtocol,client] We don't know how to recover this without restarting the whole system
2017-09-13 13:04:03-0700 [WampWebSocketClientProtocol,client] CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session before')
2017-09-13 13:04:03-0700 [-] Cancelling 1 outstanding requests
2017-09-13 13:04:03-0700 [-] Unhandled error in Deferred:
2017-09-13 13:04:03-0700 [-] Unhandled Error 
        Traceback (most recent call last):
        Failure: autobahn.wamp.exception.TransportLost: 

2017-09-13 13:04:03-0700 [-] Stopping factory <autobahn.twisted.websocket.WampWebSocketClientFactory object at 0x7fe5b8c23910>
2017-09-13 13:04:03-0700 [-] doing housekeeping for master 1 buildbot.domainname.com:/var/buildbot/Internal/Tools/BuildAutomation/buildbot-webserver
2017-09-13 13:04:03-0700 [-] while publishing event org.buildbot.mq.masters.1.stopped
        Traceback (most recent call last):
          File "/usr/lib/python2.7/site-packages/buildbot/mq/wamp.py", line 41, in produce
            d = self._produce(routingKey, data) 
          File "/usr/lib/python2.7/site-packages/buildbot/mq/wamp.py", line 62, in _produce
            return self.master.wamp.publish(self.messageTopic(routingKey), _data, options=options)
          File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1532, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
          File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
            result = g.send(result)
        --- <exception caught here> ---
          File "/usr/lib/python2.7/site-packages/buildbot/wamp/connector.py", line 114, in publish
            ret = yield service.publish(topic, data, options=options)
          File "/usr/lib/python2.7/site-packages/autobahn/wamp/protocol.py", line 1228, in publish
            raise exception.TransportLost()
        autobahn.wamp.exception.TransportLost: 

2017-09-13 13:04:03-0700 [-] Initiating clean shutdown
2017-09-13 13:04:03-0700 [-] No running jobs, starting shutdown immediately
2017-09-13 13:04:03-0700 [-] (TCP Port 8010 Closed)
2017-09-13 13:04:03-0700 [-] Stopping factory <buildbot.www.service.BuildbotSite instance at 0x4e51290>
2017-09-13 13:04:04-0700 [-] BuildMaster is stopped

Crossbar also seems to have below error (in /var/log/messages) at exact same time, but seems like crossbar continued to work.

/var/log/messages :

Sep 13 13:04:03 crossbar: 2017-09-13T13:04:02-0700 [Router       2631] Traceback (most recent call last):
Sep 13 13:04:03 crossbar: File "/usr/lib/python2.7/site-packages/autobahn/wamp/websocket.py", line 88, in onMessage
Sep 13 13:04:03 crossbar: for msg in self._serializer.unserialize(payload, isBinary):
Sep 13 13:04:03 crossbar: File "/usr/lib/python2.7/site-packages/autobahn/wamp/serializer.py", line 131, in unserialize
Sep 13 13:04:03 crossbar: msg = Klass.parse(raw_msg)
Sep 13 13:04:03 crossbar: File "/usr/lib/python2.7/site-packages/autobahn/wamp/message.py", line 1972, in parse
Sep 13 13:04:03 crossbar: topic = check_or_raise_uri(wmsg[3], u"'topic' in SUBSCRIBE", allow_empty_components=True)
Sep 13 13:04:03 crossbar: File "/usr/lib/python2.7/site-packages/autobahn/wamp/message.py", line 223, in check_or_raise_uri
Sep 13 13:04:03 crossbar: raise ProtocolError(u"{0}: invalid value '{1}' for URI (did not match pattern {2}, strict={3}, allow_empty_components={4}, allow_last_empty={5}, allow_none={6})".f
Sep 13 13:04:03 crossbar: ProtocolError: 'topic' in SUBSCRIBE: invalid value 'org.buildbot.mq.builders.MyBuilder Release Builder.' for URI (did not match pattern ^(([^\s\.#]+\.)|\.)*([^\s\.#

Buildbot should be more reliable and shouldn't crash.

tardyp commented 7 years ago

It looks like crossbar closes the connection in case of protocol error. I would say this is a crossbar/autobahn issue. It is very difficult to properly recover from a connection lost error, as there is a lot of race condition there, as you can lost any number of message during the reconnection. This is why buildbot just shutdown the master in case of disconnection. This is not impossible task though. any help is appreciated, as well as proper testing.

I think in order to quickly solve your issue, you can just fix your buildernames to be identifiers change: "MyBuilder Release Builder" to "MyBuilder_Release_Builder"