After increasing the size of the cluster config, coco started crashing due to the redis connection timing out.
May 29 01:26:01 csBfs cocod[25477]: aioredis.errors.ConnectionClosedError: Reader at end of file
May 29 01:26:01 csBfs cocod[25477]: raise ConnectionClosedError(msg)
May 29 01:26:01 csBfs cocod[25477]: File "/usr/local/lib/python3.7/site-packages/aioredis/connection.py", line 322, in execute
May 29 01:26:01 csBfs cocod[25477]: await conn.execute("rpush", f"{name}:res", json.dumps(result))
May 29 01:26:01 csBfs cocod[25477]: File "/usr/local/lib/python3.7/site-packages/coco/worker.py", line 168, in go
May 29 01:26:01 csBfs cocod[25477]: File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
May 29 01:26:01 csBfs cocod[25477]: loop.run_until_complete(asyncio.gather(go(), scheduler.start()))
May 29 01:26:01 csBfs cocod[25477]: File "/usr/local/lib/python3.7/site-packages/coco/worker.py", line 187, in main_loop
May 29 01:26:01 csBfs cocod[25477]: self._target(*self._args, **self._kwargs)
May 29 01:26:01 csBfs cocod[25477]: File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
May 29 01:26:01 csBfs cocod[25477]: self.run()
May 29 01:26:01 csBfs cocod[25477]: File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
May 29 01:26:01 csBfs cocod[25477]: Traceback (most recent call last):
May 29 01:26:01 csBfs cocod[25477]: Process Process-1:
@jrs65 proposes to open new connections instead of keeping old connections open:
"close the connection after the del on worker.py:L97
And open a new one before the push on worker.py:L168"
A possible workaround is raising the timeout in /etc/redis.conf (to 0?).
Possibly related, the metrics endpoint also failed often now:
May 28 21:14:37 csBfs cocod[25373]: ----------------------------------------
May 28 21:14:37 csBfs cocod[25373]: BrokenPipeError: [Errno 32] Broken pipe
May 28 21:14:37 csBfs cocod[25373]: self._sock.sendall(b)
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/socketserver.py", line 799, in write
May 28 21:14:37 csBfs cocod[25373]: self.wfile.write(output)
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/site-packages/coco/metric.py", line 43, in do_GET
May 28 21:14:37 csBfs cocod[25373]: method()
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/http/server.py", line 414, in handle_one_request
May 28 21:14:37 csBfs cocod[25373]: self.handle_one_request()
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/http/server.py", line 426, in handle
May 28 21:14:37 csBfs cocod[25373]: self.handle()
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/socketserver.py", line 720, in __init__
May 28 21:14:37 csBfs cocod[25373]: self.RequestHandlerClass(request, client_address, self)
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/socketserver.py", line 360, in finish_request
May 28 21:14:37 csBfs cocod[25373]: self.finish_request(request, client_address)
May 28 21:14:37 csBfs cocod[25373]: File "/usr/local/lib/python3.7/socketserver.py", line 650, in process_request_thread
May 28 21:14:37 csBfs cocod[25373]: Traceback (most recent call last):
May 28 21:14:37 csBfs cocod[25373]: Exception happened during processing of request from ('10.1.111.8', 52190)
May 28 21:14:37 csBfs cocod[25373]: ----------------------------------------
After increasing the size of the cluster config, coco started crashing due to the redis connection timing out.
@jrs65 proposes to open new connections instead of keeping old connections open: "close the connection after the del on worker.py:L97 And open a new one before the push on worker.py:L168"
A possible workaround is raising the timeout in
/etc/redis.conf
(to 0?).Possibly related, the metrics endpoint also failed often now: