dabeaz / curio

Good Curio!
Other
4.01k stars 240 forks source link

Spawning many coroutines efficiently #360

Closed dg-pb closed 1 year ago

dg-pb commented 1 year ago

Just a quick question/observation

I am doing benchmarking of several threading libraries (both stackless and stackful)

My test is very simple. It's lower bound aka best result is 1 second.

import os
import datetime as dtt
import curio

async def task(number):
    for i in range(100):
        await curio.sleep(0.01)

async def main(name, prespawn):
    name = name + ('/prespawn' if prespawn else '')
    results = list()
    for N in [10, 100, 1000, 10000]:
        start = dtt.datetime.utcnow()
        if prespawn:
            tasks = [await curio.spawn(task, i) for i in range(N)]
            await curio.TaskGroup(tasks, wait=all).join()
        else:
            async with curio.TaskGroup() as g:
                await g.spawn(task, list(range(N)))
        end = dtt.datetime.utcnow()
        results.append((end - start).total_seconds())

    results = [f'{r:0>5.2f}' for r in results]
    string = f'{name}: {" ".join(results)}'
    print(string)

curio.run(main, 'curio', True)
curio.run(main, 'curio', False)

# curio/prespawn:   01.17 01.19 01.14 62.06
# curio:            01.11 01.11 01.12 01.15

My question is why pre-spawning tasks is so much slower than spawning them from within TaskGroup context?

To give a comparison to other libraries: Asyncio

...
            if prespawn:
                tasks = [asyncio.create_task(task(i)) for i in range(N)]
                await asyncio.wait(tasks)
            else:
                await asyncio.gather(*[task(i) for i in range(N)])
...
# asyncio/uvloop:           01.13 01.14 01.16   05.88
# asyncio/uvloop/prespawn:  01.22 01.15 01.17   05.51

Trio

...
            async with trio.open_nursery() as nursery:
                for i in range(N):
                    nursery.start_soon(task)
...
# trio:                     01.15 01.12 03.28   49.50

It seems that curio is doing something extremely well with TaskGroup - results are phenomenal for 10000 tasks. But something is not working quite well when pre-spawning tasks. trio library seems to be having the same issue as number of tasks increases. asyncio seems not to be suffering from this issue.

So my question is - is it expected behaviour? Am I not doing something right? Or is there something to be looked at?

It looks like TaskGroup.spawn is directly calling curio.spawn, I can't see why results differ so much.

nocturn9x commented 1 year ago

I'm quite confused by the number argument to task. What is it for?

nocturn9x commented 1 year ago

Also, keep in mind that asyncio does a lot less work regarding task management as opposed to a structured concurrency library like trio (or curio when using task groups)

nocturn9x commented 1 year ago

Also, just wanted to chime in with the results from my own experimental library:

import os
import datetime as dtt
import structio

async def task():
    for i in range(100):
        await structio.sleep(0.01)

async def main(name):
    results = []
    for N in [10, 100, 1000, 10000]:
        start = dtt.datetime.utcnow()
        async with structio.create_pool() as p:
            for _ in range(N):
                p.spawn(task)
        end = dtt.datetime.utcnow()
        results.append((end - start).total_seconds())

    results = [f'{r:0>5.2f}' for r in results]
    string = f'{name}: {" ".join(results)}'
    print(string)

structio.run(main, 'structio')
structio: 01.00 01.01 02.25 64.57

btw, the library is heavily inspired by both trio and curio :) (but still in the early developmental stages!)

Another question I have is whether this is representative of any real world workload or if it's just a synthetic benchmark (usually numbers for those aren't that useful!)

dabeaz commented 1 year ago

I have no idea---especially given that TaskGroup.spawn() directly calls spawn().

dabeaz commented 1 year ago

The only thing weird I see is something in your test where you're not passing a number to task() but a list of numbers.

await g.spawn(task, list(range(N)))

You sure that the task is actually running correctly and not crashing out early with an exception? (possibly killing the whole task group early).

dg-pb commented 1 year ago

The only thing weird I see is something in your test where you're not passing a number to task() but a list of numbers.

await g.spawn(task, list(range(N)))

You sure that the task is actually running correctly and not crashing out early with an exception? (possibly killing the whole task group early).

That is by design. It is usually task_id in the way I do things, and I always have it as first argument. In the previous benchmarks I had it for i in range(number) for exponential effect, but removed it for clarity of having 1s expected. It doesn't have any impact I can assure.

dg-pb commented 1 year ago

If we are sharing our own experimental libraries... I actually started from yield based coroutines following David's fairly old notes. Results are as follow:

def task():
    for i in range(100):
        yield Sleep(0.01)

def main(N, prespawn, task):
    yield Group([task() for _ in range(N)])

for N in [10, 100, 1000, 10000]:
    loop = Loop()
    loop.add(main(N, prespawn, task), tid='main').run()

# 01.10 01.10  01.11   04.71

I will try to find out how on earth curio manages to get 1s for 10k calls and maybe also will find what is causing that exponential increase in the prespawned case. Interesting stuff.

To be honest, I am still considering sticking with yield, so far I was able to implement everything I needed and seemingly overhead is less than async/await

nocturn9x commented 1 year ago

If we are sharing our own experimental libraries... I actually started from yield based coroutines following David's fairly old notes. Results are as follow:

def task():
    for i in range(100):
        yield Sleep(0.01)

def main(N, prespawn, task):
    yield Group([task() for _ in range(N)])

for N in [10, 100, 1000, 10000]:
    loop = Loop()
    loop.add(main(N, prespawn, task), tid='main').run()

# 01.10 01.10  01.11   04.71

I will try to find out how on earth curio manages to get 1s for 10k calls and maybe also will find what is causing that exponential increase in the prespawned case. Interesting stuff.

To be honest, I am still considering sticking with yield, so far I was able to implement everything I needed and seemingly overhead is less than async/await

Could it be that the extra bookkeeping required by task groups is what's causing the problem?

dabeaz commented 1 year ago

Actually, the task group example as shown is only creating a single task! So, that's probably it ;-).

dg-pb commented 1 year ago

Thank you. Logged task_ids to the file and it was clear. Everything is as expected now.

Also, it seems that most of the time is spent rescheduling tasks. This line in particular: https://github.com/dabeaz/curio/blob/master/curio/kernel.py#L711

nocturn9x commented 1 year ago

Thank you. Logged task_ids to the file and it was clear. Everything is as expected now.

Also, it seems that most of the time is spent rescheduling tasks. This line in particular: https://github.com/dabeaz/curio/blob/master/curio/kernel.py#L711

I'd say that's completely expected, considering those tasks are doing nothing the entire time other than getting rescheduled every 1/100th of a second :)

dg-pb commented 1 year ago

Agreed. But, to me, it is still unclear, why it spikes so much. it takes 0.15s (1.15 - 1) to schedule and poll 100_000 tasks and 80s (8.15 - 1) to schedule and poll 1000_000 tasks. I checked number of times reschedule_task is called - all as expected: 1013 = 10 x 100 10103 = 10 x 1000 101022 1010004

dg-pb commented 1 year ago

I think I have found it. The problem is very simple:

from collections import deque

for i in [1013, 10103, 101022, 1010004]:
    q = deque() # list / dict
    t = Timer()
    with t.ctx():
        for j in range(i):
            q.append('a' * 5000)
deque   [0.02, 0.22, 2.46, 45.05]
List    [0.02, 0.23, 2.48, 39.92]
Dict    [0.02, 0.25, 2.80, 37.03]

So although deque has some overhead compared to list/dict, the actual storing is what I think causes this. And my best guess is that Task object of other frameworks is more lightweight as this largely depends on object size. Especially as the size reaches certain point, the slowdown becomes non-linear.

Having that said, I think this benchmark is only a very specific corner-case.