MagicStack / asyncpg

A fast PostgreSQL Database Client Library for Python/asyncio.
Apache License 2.0
6.88k stars 399 forks source link

Many concurrent requests block the event loop #1092

Open gongy opened 11 months ago

gongy commented 11 months ago

Reproduction (for simplicity, against a local Docker instance of Postgres).

import asyncpg

import asyncio
import time

async def main():
    asyncio.get_running_loop().slow_callback_duration = 0.05

    pg = await asyncpg.create_pool(
        user="postgres",
        password="",
        database="defaultdb",
        host="localhost",
        port="5432",
    )

    await pg.execute(
        """
        CREATE TABLE IF NOT EXISTS my_table (
            id VARCHAR(20) PRIMARY KEY,
            value VARCHAR(255)
        );
        INSERT INTO my_table (id, value) VALUES ('id-123', '4')
        ON CONFLICT (id) DO NOTHING;
        """
    )

    query = "UPDATE my_table SET value = '4';"

    for i in range(25000):

        async def go():
            for _ in range(10):
                async with pg.acquire() as conn:
                    async with conn.transaction():
                        await conn.execute(query)

        asyncio.create_task(go())

        t0 = time.time()
        await asyncio.sleep(0.001)
        elapsed_ms = (time.time() - t0) * 1000

        if elapsed_ms > 50:
            print(f">>> {i} took {elapsed_ms}ms")

# import uvloop

# asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

asyncio.run(main(), debug=True)

Output:

Executing <Task pending name='Task-1' coro=<main() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:42> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975b2e7d90>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[_run_until_complete_cb() at /usr/lib/python3.9/asyncio/base_events.py:184] created at /usr/lib/python3.9/asyncio/base_events.py:626> took 0.061 seconds
>>> 3851 took 62.56461143493652ms

Executing <Task pending name='Task-9969' coro=<main.<locals>.go() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:35> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975bb1f190>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.095 seconds
>>> 4921 took 96.0381031036377ms

Executing <Handle _SelectorSocketTransport._read_ready() created at /usr/lib/python3.9/asyncio/selector_events.py:254> took 0.126 seconds
>>> 5990 took 128.0958652496338ms

Executing <Task pending name='Task-1' coro=<main() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:42> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f9759c54490>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[_run_until_complete_cb() at /usr/lib/python3.9/asyncio/base_events.py:184] created at /usr/lib/python3.9/asyncio/base_events.py:626> took 0.138 seconds
>>> 7064 took 139.1909122467041ms

Executing <Task pending name='Task-1' coro=<main() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:42> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975bfd2580>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[_run_until_complete_cb() at /usr/lib/python3.9/asyncio/base_events.py:184] created at /usr/lib/python3.9/asyncio/base_events.py:626> took 0.166 seconds
>>> 8125 took 167.76275634765625ms

Executing <Handle _SelectorSocketTransport._read_ready() created at /usr/lib/python3.9/asyncio/selector_events.py:254> took 0.193 seconds
>>> 9282 took 194.34309005737305ms
Executing <Task pending name='Task-21463' coro=<main.<locals>.go() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:35> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975902e3a0>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.232 seconds
>>> 10635 took 232.7420711517334ms

Executing <Handle _SelectorSocketTransport._read_ready() created at /usr/lib/python3.9/asyncio/selector_events.py:254> took 0.256 seconds
>>> 12092 took 257.3966979980469ms

Executing <Task pending name='Task-27750' coro=<PoolConnectionHolder.release() running at /home/ubuntu/modal/asyncpg/asyncpg/pool.py:198> wait_for=<Future pending cb=[Protocol._on_waiter_completed(), <TaskWakeupMethWrapper object at 0x7f975b3fc8b0>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[shield.<locals>._inner_done_callback() at /usr/lib/python3.9/asyncio/tasks.py:890] created at /usr/lib/python3.9/asyncio/tasks.py:883> took 0.292 seconds
>>> 13746 took 293.03622245788574ms

Executing <Task pending name='Task-31496' coro=<main.<locals>.go() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:35> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975bb1fe20>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.321 seconds
>>> 15671 took 321.78592681884766ms

Executing <Task pending name='Task-1' coro=<main() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:42> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f975ae52ca0>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[_run_until_complete_cb() at /usr/lib/python3.9/asyncio/base_events.py:184] created at /usr/lib/python3.9/asyncio/base_events.py:626> took 0.375 seconds
>>> 17901 took 376.0251998901367ms

Executing <Task pending name='Task-21270' coro=<main.<locals>.go() running at /home/ubuntu/modal/analytics/asyncpg_blocking.py:37> wait_for=<Future pending cb=[shield.<locals>._outer_done_callback() at /usr/lib/python3.9/asyncio/tasks.py:907, <TaskWakeupMethWrapper object at 0x7f975787ac40>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.418 seconds
>>> 20313 took 418.8547134399414ms

Executing <Task pending name='Task-46404' coro=<PoolConnectionHolder.release() running at /home/ubuntu/modal/asyncpg/asyncpg/pool.py:198> wait_for=<Future pending cb=[Protocol._on_waiter_completed(), <TaskWakeupMethWrapper object at 0x7f975504dc70>()] created at /usr/lib/python3.9/asyncio/base_events.py:429> cb=[shield.<locals>._inner_done_callback() at /usr/lib/python3.9/asyncio/tasks.py:890] created at /usr/lib/python3.9/asyncio/tasks.py:883> took 0.486 seconds
>>> 23108 took 487.2567653656006m

Further investigation

Adding verbose prints to protocol.pyx led to me chasing down one particular 80ms+ blocking execution, which ended at waiter.set_result(...) in _on_result__simple_query, which took up the majority (150ms out of 151ms, for example) of a slow callback. After this, I wasn't sure how to continue debugging -- open to suggestions or ideas here.

Removing debug=True

The issue is still present, albeit less frequent, without debug mode on

>>> 11807 took 86.72428131103516ms
>>> 16438 took 117.77758598327637ms
>>> 21042 took 172.47319221496582ms

Thanks all!

racinette commented 11 months ago

I think it's due to too many connections being requested from pool at the same time. You should increase your pool size according to your load, so the requests for new connections wouldn't queue up.

The execution blocks, because every async with pool.acquire() block wants a connection from the pool, but when there are none left, it blocks until there is one it can take.

asyncpg defaults to 10 max connections in pool, I think? So it's definitely too few for 10000-ish concurrent requests :)

gongy commented 11 months ago

The issue is reproducible with min_size=100, max_size=100. Increasing the size of the pool is not a feasible workaround here.

Crucially, awaiting a new connection from the pool should not block the asyncio event loop -- that is, cause asyncio.sleep(0.001) to take more than 100ms. This would affect everything else running on the event loop (e.g. a production server).

elprans commented 11 months ago

Crucially, awaiting a new connection from the pool should not block the asyncio event loop

It doesn't. In your repro case all coroutines are blocked on the acquire() path. Start an independent task that does not depend on a connection to verify.

ekzhang commented 11 months ago

Hi @elprans -- in the code sample, the task that is acquiring a connection from the pool is on an independent task from the one calling asyncio.sleep().

The time measurement in the code block is between two instants before and after asyncio.sleep(0.001), without any database operations on that task, and it is measuring over 100 ms in our reproduction.

We believe we are seeing asyncpg block the event loop and actually prevent other coroutines from making progress.

ekzhang commented 11 months ago
        t0 = time.time()
        await asyncio.sleep(0.001)
        elapsed_ms = (time.time() - t0) * 1000

        if elapsed_ms > 50:
            print(f">>> {i} took {elapsed_ms}ms")

^ If it helps, this is the code segment that I'm referring to. This prints increasing values up to 480 ms when running our minimal reproduction script, even though we only sleep for 1 ms. If I'm not mistaken, this coroutine is independent of the acquire() path — does that clarify the bug report?

elprans commented 11 months ago

I'm still not sure what you are referring to. If you create thousands of tasks like this, then the event loop might indeed take a long time to get back to you after that sleep(). But asyncpg itself does not "block" the event loop in any way. If it did, then you would get asyncio debug messages about acquire or some other internal coroutine or callback occupying the CPU for a long time, and I don't see that here.

elprans commented 11 months ago

(this is also indirectly confirmed by the fact that disabling debug helps, as the event loop has lower overhead then).

ekzhang commented 11 months ago

Thanks for the comments. We have been seeing issues where asyncpg itself seems to block the event loop even when we don't back up on many asyncio tasks at once.

I can see why this minimal reproduction would be confusing though, as it appears that it creates many tasks. I'll try to provide a better example.

you would get asyncio debug messages about acquire or some other internal coroutine or callback occupying the CPU for a long time, and I don't see that here.

We understand this, but is it possible that the stack trace is obscured due to the Cython boundary? Very unlikely that set_result() on asyncio's object would block for 100 milliseconds, and a quick search indicated to us that it could be the other end of that handler that's waiting for the result, which blocked for 100 milliseconds.

elprans commented 11 months ago

Very unlikely that set_result() on asyncio's object would block for 100 milliseconds, and a quick search indicated to us that it could be the other end of that handler that's waiting for the result, which blocked for 100 milliseconds.

There are no busyloops anywhere in asyncpg, all waits are await-s, so you are at the mercy of the event loop. We are talking about a single-threaded iterate-over-a-bunch-of-stuff loop. If there is a blocking pathology somewhere in the asyncpg code, PYTHONASYNCIODEBUG=1 should clearly point it out under relatively low load. Your repro does not seem to demonstrate that, as there is a variety of "slow tasks" in your debug output, including _SelectorSocketTransport._read_ready(), which indicates an overloaded event loop and nothing else.

ekzhang commented 11 months ago

If there is a blocking pathology somewhere in the asyncpg code, PYTHONASYNCIODEBUG=1 should clearly point it out under relatively low load. Your repro does not seem to demonstrate that, as there is a variety of "slow tasks" in your debug output, including _SelectorSocketTransport._read_ready(), which indicates an overloaded event loop and nothing else.

Thanks for the help and your pointers! We tried to reproduce it again and have confirmed that as you said, if we use a semaphore or otherwise limit concurrency len(asyncio.all_tasks()) to below 500, we can't reproduce the issue in any of our asyncpg tests. The reproduction we sent was flawed, and you're right, its delayed tasks were because the event loop was slowed down.

We can still reproduce an issue we saw when using our ORM on top of asyncpg under low load, but that's not an issue with asyncpg, and we misattributed it to asyncpg when we made this issue.

Thanks so much for your help.