aio-libs / aiopg

aiopg is a library for accessing a PostgreSQL database from the asyncio
http://aiopg.readthedocs.io
BSD 2-Clause "Simplified" License
1.4k stars 160 forks source link

aiopg times out while attempting to connect if falling back to an IPv4 address #907

Open thanatos opened 7 months ago

thanatos commented 7 months ago

Describe the bug

While attempting to connect to CockroachDB, aiopg raises TimeoutError.

To Reproduce

  1. Run CockroachDB on localhost. I am running v22.2.19 on Linux.
  2. Execute the minimal example below.
import argparse
import asyncio
import logging
from pathlib import Path

import aiopg

async def test_db_pool(dbname: str, username: str, certs: Path):
    logging.info('Creating test connection.')
    kwargs = {
        'host': 'localhost',
        'port': 26257,
        'dbname': dbname,
        'user': username,
        'sslmode': 'verify-full',
        'sslcert': certs / f'client.{username}.crt',
        'sslkey': certs / f'client.{username}.key',
        'sslrootcert': certs / 'ca.crt',
    }
    async with aiopg.connect(**kwargs) as pool:
        logging.info('Connection ready.')

def main():
    logging.basicConfig(level=logging.DEBUG)
    parser = argparse.ArgumentParser()
    parser.add_argument('--certs-dir', action='store')
    args = parser.parse_args()

    asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir)))

if __name__ == '__main__':
    main()

Expected behavior

The example exits successfully.

Logs/tracebacks

DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Creating test connection.
Traceback (most recent call last):
  File "/usr/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 881, in _poll
    await asyncio.wait_for(self._waiter, timeout)
  File "/usr/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/code/example/./aio_fail.py", line 35, in <module>
    main()
  File "/code/example/./aio_fail.py", line 31, in main
    asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir)))
  File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/code/example/./aio_fail.py", line 21, in test_db_pool
    async with aiopg.connect(**kwargs) as pool:
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/utils.py", line 82, in __aenter__
    self._obj = await self._coro
                ^^^^^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 1225, in _connect
    await self._poll(self._waiter, self._timeout)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 883, in _poll
    await asyncio.shield(self.close())
                         ^^^^^^^^^^^^
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 995, in close
    self._close()
  File "/code/example/venv/lib/python3.11/site-packages/aiopg/connection.py", line 977, in _close
    self._loop.remove_reader(self._fileno)
  File "/usr/lib/python3.11/asyncio/selector_events.py", line 345, in remove_reader
    return self._remove_reader(fd)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/selector_events.py", line 290, in _remove_reader
    self._selector.modify(fd, mask, (None, writer))
  File "/usr/lib/python3.11/selectors.py", line 389, in modify
    self._selector.modify(key.fd, selector_events)
FileNotFoundError: [Errno 2] No such file or directory

Python Version

$ python --version
Python 3.11.5

aiopg Version

$ python -m pip show aiopg
Name: aiopg
Version: 1.4.0
Summary: Postgres integration with asyncio.
Home-page: https://aiopg.readthedocs.io
Author: Andrew Svetlov
Author-email: andrew.svetlov@gmail.com
License: BSD
Location: /home/royiv/code/groceries/venv/lib/python3.11/site-packages
Requires: async-timeout, psycopg2-binary
Required-by: 

OS

Arch Linux

Additional context

Cockroach is a distributed database that uses the PostgreSQL wire protocol for queries. psql is capable of connecting to it, and executing queries. (So is psycopg2.)

When running the example, there is a long pause after Creating test connection., approximately 60s.

aiopg worked a few CockroachDB versions prior to this one, but I am sorry, I did not record the working version's number.

A packet captures shows aiopg makes the connection (the 3-way handshake happens) but no data is ever transmitted. This is a bit surprising, since at this point I am not sure what Cockroach could have done to have acted differently from vanilla PG.

Code of Conduct

thanatos commented 7 months ago

since at this point I am not sure what Cockroach could have done to have acted differently from vanilla PG.

I can replicate this with vanilla Postgres & aiopg, so Cockroach can be removed from play here.

The difference is that PG listens, by default, to "localhost", and will bind to ::1 on that interface.

::1 is the first address aiopg will attempt, and the connection is made.

My CockroachDB instance, OTOH, is bound to 127.0.0.1 specifically. aiopg appears to try ::1 first (same as above with PG); this connection fails, as one would expect. aiopg correctly goes on to try 127.0.0.1, and that connection is made successfully, but aiopg hangs for some reason.

We can replicate this with vanilla PG by binding to 127.0.0.1 explicitly, like Cockroach was:

(in postgresql.conf)

listen_addresses = '127.0.0.1'

And then some slight tweaks to the example:

A slightly different example ```python3 import argparse import asyncio import logging from pathlib import Path import aiopg async def test_db_pool(dbname: str, username: str, certs: Path): logging.info('Creating test connection.') kwargs = { 'host': 'localhost', #'port': 26257, 'port': 5432, 'dbname': dbname, #'user': username, 'user': 'my-username', #'sslmode': 'verify-full', #'sslcert': certs / f'client.{username}.crt', #'sslkey': certs / f'client.{username}.key', #'sslrootcert': certs / 'ca.crt', } async with aiopg.connect(**kwargs) as pool: logging.info('Connection ready.') def main(): logging.basicConfig(level=logging.DEBUG) parser = argparse.ArgumentParser() parser.add_argument('--certs-dir', action='store') args = parser.parse_args() asyncio.run(test_db_pool('postgres', 'root', Path(args.certs_dir))) if __name__ == '__main__': main() ```

Now knowing this, we can get back to Cockroach: I had been passing --listen-addr localhost; this (apparently; this seems to be a bunch in Cockroach) causes it to bind (oddly) to 127.0.0.1. If I explicitly tell it to bind to --listen-addr '[::1]', then it binds to ::1, and aiopg doesn't have to try to fall back, so we avoid the bug.

But it's "falling back to attempting to connect over IPv4" that is the critical bit, not Cockroach. And we hit the same issue, now with PG.

thanatos commented 7 months ago

My working theory on this issue is that aiopg isn't managing the async events it is waiting on properly.

We need to call <pg conn>.poll(), during which psycopg2 will evaluate the state of the socket; it returns a value indicating what event we need to wait on.

https://github.com/aio-libs/aiopg/blob/7b01833100b83f65ba64a5ba877eabc369242b8b/aiopg/connection.py#L835-L840

I think the "problem" is that if not self._writing. Normally, we're attempting to make sure that we've not already got a loop.add_writer call pending on this socket. In that regard, the if is sensible. What I think is happening is that during poll(), PG is going to close & reopen the socket — to switch to an IPv4 connection — and when it does this, we lose our add_writer callback. (I assume b/c underneath, it's just configuring an epollfd, and that epollfd knows the socket is closed and drops the events.) However, the if prevents us from re-establishing the handler.

I added some debugging to see when _ready is invoked, and the actual state of the socket before/after poll, and what we decide to then do in this iteration of _ready:

ready(); from _poll
  sock (  me) = ('::1', 48080, 0, 0)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (  me) = ('::1', 48080, 0, 0)
ready(): POLL_WRITE
ready(): POLL_WRITE -- doing add_writer

ready(); add_reader
  sock (  me) = ('::1', 48080, 0, 0)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (peer) = ('127.0.0.1', 5432)
  sock (  me) = ('127.0.0.1', 48760)
ready(): POLL_WRITE

ready(); add_writer … ('::1', 48080, 0, 0)
  sock (  me) = ('127.0.0.1', 48760)
ready(): poll ok
  fileno, us = 6, inner = 6
  sock (peer) = ('127.0.0.1', 5432)
  sock (  me) = ('127.0.0.1', 48760)
ready(): POLL_WRITE

I'm not 100% convinced of this yet. Even if it's correct … boy do I not know how we're to fix it! It seems like the interface w/ PG doesn't give us the info we need to know that the FD has been swapped out. It's all good and fine if you're using select(2), like psycopg2's docs use in their examples, but if you're using epoll(2) … I worry you need to know that the socket has been recreated & your epoll triggers need to be recreated. And asyncio is going to be using epoll(2).