cplusplus / sender-receiver

Issues list for P2300
Apache License 2.0
14 stars 3 forks source link

Investigate whether we can reduce amount of copying of parameters in large sender expressions #225

Open lewissbaker opened 1 month ago

lewissbaker commented 1 month ago

When building a sender expression, the evaluation of the expression first builds the leaf senders, then passes these as arguments to higher-level senders, which needs to move-construct them into data-members of itself, which might then be passed to another sender algorithm which then needs to move-construct that whole sub-tree of senders into its data-member, etc.

In the end, a leaf operation (and any state it holds) will be move-constructed O(depth) times when incorporated into a sender-tree of a given depth. As sender-trees typically get wider as they go deeper, the number of move-operations needed to build the final tree can be quadratic as the tree is built.

Further, if the tree is built as a single expression, then all of the temporary intermediate senders still exist until the end of the full-expression and so the amount of stack space needed to create a large expression tree can potentially grow quadratically in the size of the expression tree.

For example: consider a binary tree of when_all() operations where leaves are just(X{}):

// For example, a tree of depth 3
when_all(
  when_all(just(X{}), just(X{})),
  when_all(just(X{}), just(X{})));
Depthjust senders
in final tree
when_all senders
in final tree
X temporaries/moves
1101
2214
34312
48732
5161580

If this sender-expression was then passed to a co_await or sync_wait expression, then this results in the operation-state objects being constructed, which then generally moves the state from the final sender tree into an operation-state tree.

However, with the introduction of transform_sender()-based customization, this can result in moving the entire tree again (the default_domain::transform_sender() function returns a new prvalue sender rather than just returning the input reference.

Further, if the pipe syntax is used, then the number of moves of each leaf argument in a sender tree generally increases by 1 as you first need to move the object into a temporary adaptor object, before then moving it into the initial leaf sender.

We should see if we can reduce the number of intermediate objects required for building large trees, if possible, by directly constructing senders using aggregate initialization.

lewissbaker commented 1 month ago

As a data-point, in the following snippet in the stdexec implementation, the X move-constructor is called 5 times!

#include <stdexec/execution.hpp>
#include <cstdio>

struct X {
    X() noexcept { std::puts("X::X()"); }
    X(X&&) noexcept { std::puts("X::X(X&&)"); }
    ~X() noexcept { std::puts("X::~X()"); }
};

int main() {
    auto result = stdexec::sync_wait(stdexec::just(X{}));
    std::puts("done");
}

We could reduce this to 4 moves by fixing the implementation of transform_sender() to not return copies in the case that it is a no-op (bug filed at https://github.com/NVIDIA/stdexec/issues/1329).

We could reduce this to 2 moves if we can support constructing the just sender by aggregate initialization instead of having to move the value into the tuple and then move the tuple into the just sender.