chriskohlhoff / asio

Asio C++ Library
http://think-async.com/Asio
4.72k stars 1.19k forks source link

[Discussion] Stackful vs stackless coroutines #1408

Open pfeatherstone opened 5 months ago

pfeatherstone commented 5 months ago

Since there isn't a discussion tab, I'm having to post an issue.

Does anyone have opinions on the use of stackful vs stackless coroutines in asio? Some of the discussion points?

My guess is that C++20 coroutines are more lightweight. However they might do more allocations? So in a high performance server handling millions of HTTP requests per second, C++20 coroutines might not be a good choice? I don't know, I'm asking.

It looks like asio has really good tooling for C++20 coroutines. For example, co_awaiting multiple awaitables. That's a nice feature, since those things can happen in parallel presumably. I'm not sure, but I don't think that's so effortlessly done with asio's stackful coroutines.

awaitable is an implicit strand. Does the same hold for basic_yield_context ?

Basically, does someone have a good story to tell regarding this?

Also, Boost::Cobalt provides some library coroutines. Has anyone tried using them with Asio? Why should i use them instead of asio's awaitable or experimental::coro coroutines?

justend29 commented 5 months ago

@pfeatherstone I've had great experience with stackless coroutines. I use them both stand-alone and with Beast for HTTP. The usability is much higher than that of stackful ones.

Certain blog posts I've read have indicated marginal benefits of stackless over stackful coroutines in terms of throughput and response times, but these are all use-case-dependent. Benchmark your fast-paths with the two. I can assume memory usage with stackful coroutines would be slightly higher as the heuristics to pre-allocate "stack" are often a bit eager, but it's possible that reduces total number of allocations. Additionally, custom allocators can be used in any asynchronous operation, of which a pool could help for stackless coroutines.

The implicit strand is not a consequence of some internal magic, it's merely a result of continuations. Conceptually, an awaitable's continuation is the code below its co_await call. Any chain of continuations is a "strand" because the subsequent completion handler cannot be started before the preceding operation is done, regardless of its associated executor. That applies to all completion chains.

I don't have experience with cobalt, as it just came out, but it uses 1 thread on io_context as the only executor which is disadvantageous in my uses.

klemens-morgenstern commented 5 months ago

I am the author of boost.cobalt and boost.experimental.coro, so obviously a fan of C++20 coros.

Overview of utilities

The stackful coroutines are implemented on top of boost.context, which provides the assembly needed for context switching. boost::asio::spawn/yield_context is similar to boost::asio::co_spawn/awaitable, in the sense that it gives you a simple coroutine that can await async operations. The first is based on boost.context, the second on C++20 coros.

experimental.coro is different in that it's essentially it's own io-object. I.e. when you call .async_resume you turn a coroutine resumption into an async operation.

boost.cobalt holds a similar place to boost.fiber in that it can interact with asio (technically cobalt is dependent on asio, whereas fiber isn't), but has it's own synchronization mechanisms that are optimized for the usage. The best example are channel, which both libraries have. boost.fiber is much more written to look & feel like threads, whereas cobalt is meant to look & feel like async/await in python or javascript.

Context switching

Regarding performance: I benchmarked on linux and a context switch with context was 2.1x (gcc) or 5.2x (clang) slower than a C++20 coroutine. Plus boost.context is (necessarily) assemby, so it's opaque to the compiler. That is, it can't be optimized out, whereas C++20 coroutines can and are, although rarely. The post benchmark might be the best comparison for usage with asio.

Code style

A major difference between the two kinds is how the suspension takes place. C++20 coroutines require the use of the co_await keyword. I quite like this because it tell you something is async. This makes reading the code much easier.

awaitable<void> coro()
{
    foo();
    co_await bar(); // co_await screams that this is async
}

void coro(yield_context yield_)
{
    foo();
    bar(yield_); // idk, is this async?
}

That also means you could make async optional with yield_context if you so desire.


thread_local asio::yield_context * yield_ = nullptr;
std::size_t my_read(socket & s)
{
   if (yield_)
      return s.async_read(*yield_);
   else
      return s.read();
}

Stack depth

Because stackful coroutines are on a stack it means you can nest calls & suspensions without any overhead.

std::size_t do_read(socket & s, yield_context & ctx) // it's essentially a regular function
{
   s.async_read(ctx);
}

This is relevant for your API design. Not however that the coroutine's stack is not very large by default.

Due to C++20 coros being stackless, you can have nested calls far deeper than a stackful one.

asio::awaitable<void> recurse_for_no_reason(std::size_t n)
{
   if (n-- >= 0)
      co_await recurse_for_no_reason(n);
}

Since the function frame lives on the stack this will lead to NO stack buildup.

Allocations

The downside of C++20 coros is the need to allocate their frame. This can be modified (with boost.cobalt) and is cached with asio::awaitable, but it might still be a concern.

asio::experimental::coro<void, std::size_t> do_read(socket & s) 
{
   co_return co_await s.async_read(asio::deferred); 
}

Now every co_await do_read(sock) will add a frame allocation (unless optimized out, which is unlikely at this point).

You can workaround this by using yielding coros, so that you have one allocation upfront.

asio::experimental::coro<std::size_t, std::size_t> reader(socket & s) 
{
   while (true)
      co_yield co_await s.async_read(asio::deferred); 
   co_return -1;
}

And if you need to update arguments half way through you can use push-args:

asio::experimental::coro<std::size_t(socket &), std::size_t> reader(socket & s) 
{
  auto * s_ = s&;
   while (true)
      s_=  & co_yield co_await s_->async_read(asio::deferred); 
   co_return -1;
}
justend29 commented 5 months ago

Wow, that was beautiful