Timeout configuration for multi-host config

asyncpg version: 0.29.0
PostgreSQL version: ANY
Do you use a PostgreSQL SaaS? If so, which? Can you reproduce the issue with a local PostgreSQL install?: Can reproduce locally
Python version: 3.11, but should not matter much
Platform: macOS, Linux
Do you use pgbouncer?: No
Can the issue be reproduced under both asyncio and uvloop?: Yes

I think the same or similar issue was already raised before, but it was closed so I decided to create a new one and provide a bit more details and reproducers.

I opened a discussion about this almost a year ago which has some details as well, but I never followed up. I would really love to get some attention to this now as we'll go through the same database upgrade steps again soon and I would like to avoid patching the library locally.

In a nutshell, the issue is that the existing timeout does not work well in certain conditions with multi-host setup: when the first host can be resolved, as in given it's DNS name, one can get IP address, but the host is nevertheless unreachable and connection cannot be established. In this case the whole timeout might be used up waiting for a connection to the first host and there would never be an attempt to connect to the second host.

In the linked ticket I described one case when this actually works by configuring a fairly large timeout of 80 seconds. In this case the first connection attempt times out after 75 seconds and there's still 5 more seconds to attempt to connect to the second host. After some googling I found that these 75 seconds seem to be TCP SYN wait-timer default.

I have made a reproducer that tests both cases and does not require actual database:

import asyncio
import time

import asyncpg
from asyncpg import Connection

async def connect(timeout: int) -> Connection:
    return await asyncpg.connect(
        "postgresql://",
        user="postgres",
        password="",
        host=[
            # Non-routable IP address to simulate connect timeout
            "10.255.255.1",
            # It does not really matter if we have working database
            # for the same of the test as long as we see that there was an attempt to connect
            "unknown-host-name",
        ],
        port=[5432, 45432],
        database="postgres",
        server_settings={
            "search_path": "public",
            "statement_timeout": "60000",
        },
        timeout=timeout,
    )

async def main(timeout: int):
    print("Connecting with timeout", timeout)
    start = time.time()
    try:
        await connect(timeout)
    except TimeoutError:
        end = time.time()
        print("Issue reproduced: connection timeout")
        print(f"Elapsed time: {end - start}\n\n")
    except OSError:
        end = time.time()
        print("No issue: attempted to connect to the second host as expected")
        print(f"Elapsed time: {end - start}\n\n")

asyncio.run(main(timeout=5))
asyncio.run(main(timeout=80))

I have a similar test that uses psycopg2 (though it uses SQLAlchemy, but I can rewrite it with pure psycopg2 if needed) that demonstrates how this case is handled there:

from time import time

from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker

def create_session(timeout: int):
    engine = create_engine(
        f"postgresql://",  # Only specify which driver we want to use
        connect_args={
            "user": "postgres",
            "dbname": "postgres",
            "host": "10.255.255.1,10.255.255.1",  # We specify multiple hosts here, both unreachable
            "port": "5432,5432",  # We specify multiple ports here: a port for each host above
            "connect_timeout": timeout,
        },
    )

    return sessionmaker(autoflush=False, bind=engine, expire_on_commit=False)()

def main(timeout: int):
    print("Connecting with timeout", timeout)
    start = time()
    try:
        with create_session(2) as session:
            # This will print the Docker container ID, because it's used as the hostname inside Docker network
            print(
                "Database host:",
                session.execute(text("select pg_read_file('/etc/hostname') as hostname;")).one()[0].strip(),
            )
        end = time()
        print("Elapsed time", end - start)
    except Exception as e:
        end = time()
        print("Failed to connect")
        print("Elapsed time", end - start)

main(2)

This case is slightly different, because I have specified both host as unreachable for the same of demonstrating that the overall time it takes to attempt a connect is twice the timeout specified. This means that psycopg2 applies the same timeout when connecting to each host, not an overall timeout.

Whether this is desirable or not is hard to say, I can imagine having an overall connection timeout might be a good thing as well, so maybe we could have two timeouts.

I'm happy to help with the PR to resolve this once it's clear how we want to have this fixed.

MagicStack / asyncpg

Timeout configuration for multi-host config #1155