compose / governor

Runners to orchestrate a high-availability PostgreSQL
MIT License
512 stars 75 forks source link

not catching ssl timeout exception #37

Closed cakester closed 8 years ago

cakester commented 8 years ago

not catching exception

return self.etcd.touch_member(self.state_handler.ip) File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 130, in touch_member self.put_client_path("/members/%s" % value, {"value": value, "ttl": self.ttl}) File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 67, in put_client_path urllib2.urlopen(request, timeout=self.timeout).read() File "/usr/local/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/local/lib/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/local/lib/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/local/lib/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/local/lib/python2.7/urllib2.py", line 1240, in https_open context=self._context) File "/usr/local/lib/python2.7/urllib2.py", line 1200, in do_open r = h.getresponse(buffering=True) File "/usr/local/lib/python2.7/httplib.py", line 1073, in getresponse response.begin() File "/usr/local/lib/python2.7/httplib.py", line 415, in begin version, status, reason = self._read_status() File "/usr/local/lib/python2.7/httplib.py", line 371, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/local/lib/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) File "/usr/local/lib/python2.7/ssl.py", line 714, in recv return self.read(buflen) File "/usr/local/lib/python2.7/ssl.py", line 608, in read v = self._sslobj.read(len or 1024) ssl.SSLError: ('The read operation timed out',)

Winslett commented 8 years ago

Specifically regarding ssl.SSLError raised from the touch_member method:

These errors should be handled differently based on where the error happens. If the error happens at initialization, we should rescue from it, but we should notify about the flakiness of the SSL attempt.

During the HA loop, we should probably re-attempt this command, but if it fails, then we should gracefully degrade Governor's capabilities until etcd access has been restored.

Long story short, it's not just rescuing this additional error, but doing it properly in context.