codypiersall / pynng

Python bindings for Nanomsg Next Generation.
https://pynng.readthedocs.io
MIT License
260 stars 58 forks source link

Dial fails with NNGException: Connection shutdown in kubernetes #116

Open twisteroidambassador opened 1 year ago

twisteroidambassador commented 1 year ago

Started seeing this problem after migrating some apps to kubernetes.

My original setup on bare metal is like this: There's one instance of the "responder" app, listening on a Rep0 socket, and several instances of the "requester" app, dialing the responder with Req0 sockets. All these instances run on the same host machine. Every day on a timer, the requester instances start up first, and after a few minutes the responder starts. The requester's code is like this:

async with contextlib.AsyncExitStack() as stack:
    req = stack.enter_context(pynng.Req0(dial='tcp://responder:7470'))
    while True:
        await req.asend(b'Hello')
        resp = await req.arecv()
        # do stuff

There was never a problem with requesters start dialing before responder starts listening. The Req0 socket simply fails the initial sync dial, changes to async dialing, and eventually connects.

Then, I had to migrate this setup to kubernetes. So I made a responder deployment with one pod, a responder service pointing to the Rep0 port of the responder pod, and a requester deployment with several pods. The requesters dial the service address of the responder.

In this setup, there's a chance that the requesters' dialing attempts fail outright:

File "/app/requester.py", line 54, in do_work
  req = stack.enter_context(pynng.Req0(dial=f'tcp://{responder_host}:{responder_port}'))
File "/app/venv/lib/python3.9/site-packages/pynng/nng.py", line 938, in __init__
  super().__init__(**kwargs)
File "/app/venv/lib/python3.9/site-packages/pynng/nng.py", line 349, in __init__
  self.dial(dial, block=block_on_dial)
File "/app/venv/lib/python3.9/site-packages/pynng/nng.py", line 374, in dial
  return self.dial(address, block=True)
File "/app/venv/lib/python3.9/site-packages/pynng/nng.py", line 371, in dial
  return self._dial(address, flags=0)
File "/app/venv/lib/python3.9/site-packages/pynng/nng.py", line 390, in _dial
  check_err(ret)
File "/app/venv/lib/python3.9/site-packages/pynng/exceptions.py", line 201, in check_err
  raise exc(string, err)
pynng.exceptions.NNGException: Connection shutdown

This is an uncaught exception, and the requester basically dies without retrying. Why does this happen?