Investigate whether we can reduce amount of copying of parameters in large sender expressions

cplusplus / sender-receiver

Issues list for P2300

Apache License 2.0

14 stars 3 forks source link

When building a sender expression, the evaluation of the expression first builds the leaf senders, then passes these as arguments to higher-level senders, which needs to move-construct them into data-members of itself, which might then be passed to another sender algorithm which then needs to move-construct that whole sub-tree of senders into its data-member, etc.

In the end, a leaf operation (and any state it holds) will be move-constructed O(depth) times when incorporated into a sender-tree of a given depth. As sender-trees typically get wider as they go deeper, the number of move-operations needed to build the final tree can be quadratic as the tree is built.

Further, if the tree is built as a single expression, then all of the temporary intermediate senders still exist until the end of the full-expression and so the amount of stack space needed to create a large expression tree can potentially grow quadratically in the size of the expression tree.

For example: consider a binary tree of when_all() operations where leaves are just(X{}):

// For example, a tree of depth 3
when_all(
  when_all(just(X{}), just(X{})),
  when_all(just(X{}), just(X{})));

Depth	`just` senders in final tree	`when_all` senders in final tree	`X` temporaries/moves
1	1	0	1
2	2	1	4
3	4	3	12
4	8	7	32
5	16	15	80

If this sender-expression was then passed to a co_await or sync_wait expression, then this results in the operation-state objects being constructed, which then generally moves the state from the final sender tree into an operation-state tree.

However, with the introduction of transform_sender()-based customization, this can result in moving the entire tree again (the default_domain::transform_sender() function returns a new prvalue sender rather than just returning the input reference.

Further, if the pipe syntax is used, then the number of moves of each leaf argument in a sender tree generally increases by 1 as you first need to move the object into a temporary adaptor object, before then moving it into the initial leaf sender.

We should see if we can reduce the number of intermediate objects required for building large trees, if possible, by directly constructing senders using aggregate initialization.

As a data-point, in the following snippet in the stdexec implementation, the X move-constructor is called 5 times!

#include <stdexec/execution.hpp>
#include <cstdio>

struct X {
    X() noexcept { std::puts("X::X()"); }
    X(X&&) noexcept { std::puts("X::X(X&&)"); }
    ~X() noexcept { std::puts("X::~X()"); }
};

int main() {
    auto result = stdexec::sync_wait(stdexec::just(X{}));
    std::puts("done");
}

temporary X() is moved into a temporary tuple{values...} inside just()
which is then moved into the just sender as the state
sync_wait
- calls stdexec::apply_sender(default_domain{}, sender, env)
- which calls default_domain::apply_sender(sender, env)
- which calls sync_wait_t::apply_sender(sender, env)
- which calls stdexec::connect() to produce an op-state
- which calls stdexec::transform_sender(default_domain(), sender, env)
- which calls default_domain::transform_sender(sender, env)
- which returns a new just sender move-constructed from the original
then stdexec::connect calls .connect() on this new just sender which moves the value into the operation-state
then sync_wait() calls start() which invokes the sync_wait receiver's set_value which moves the result into an optional object, which is then returned from sync_wait().

We could reduce this to 4 moves by fixing the implementation of transform_sender() to not return copies in the case that it is a no-op (bug filed at https://github.com/NVIDIA/stdexec/issues/1329).

We could reduce this to 2 moves if we can support constructing the just sender by aggregate initialization instead of having to move the value into the tuple and then move the tuple into the just sender.

cplusplus / sender-receiver

Investigate whether we can reduce amount of copying of parameters in large sender expressions #225