Execution interfaces should be variadic

hkaiser commented 8 years ago

All execution interfaces should be variadic in order to:

be consistent with other execution interfaces we already have (invoke, async, thread::thread, etc.)
avoid having the user to have to jump through unneeded hoops (lambdas, bind(), etc.)

This is also consistent with the experiences we have gotten from our use of HPX where we implemented a variadic executor interface to good effect.

K-ballo commented 8 years ago

+1 The interface should just do the right thing, packing the arguments to form a nullary callable if needed.

jaredhoberock commented 8 years ago

Thanks for the suggestion Hartmut. I agree that the user shouldn't have to perform tasks like parameter forwarding that can be applied mechanically and in a general purpose fashion. That's why this role is performed by higher-level control structures like std::async & std::invoke.

I think this is a reasonable separation because as currently specified, neither an executor client nor an executor author has to perform these gymnastics. If executor operations were required to be variadic, instead of simply receiving nullary functions, parameter forwarding would be an obstacle for executor implementors.

hkaiser commented 8 years ago

Forcing the executor interfaces to be nullary will cause unnecessary implementation complexity on things like the parallel algorithms. Why sprinkle parameter packing/unpacking over all the places where executors are used?

Also, users might want to use executors directly in their code.

Making the executor interfaces nullary requires to use a lambda which is error prone (arguments must be forwarded into the capture, must be moved/forwarded out of the capture when calling the function, and the lambda would have to be mutable):

template <typename F, typename ... Args>
future<result_of_t<F&&(Args&&...)>> foo(F && f, Args && ... args)
{
    return execution::async_execute(my_exec,
        [f = forward<F>(f), args = forward<Args>(args)...]() mutable
        {
             return invoke(forward<F>(f), forward<Args>(args)...);
        });
}

Alternatively we would have to introduce a lightweight bind which moves its arguments, e.g. something like:

template <typename F, typename ... Args>
future<result_of_t<F&&(Args&&...)>> foo(F && f, Args && ... args)
{
    return execution::async_execute(my_exec,
        deferred_call(forward<F>(f), forward<Args>(args)...)
    );
}

All of this increases implementation complexity.

At the same time, with a variadic interface this would just look like:

template <typename F, typename ... Args>
future<result_of_t<F&&(Args&&...)>> foo(F && f, Args && ... args)
{
    return execution::async_execute(my_exec, forward<F>(f), forward<Args>(args)...);
}

From all of this I think that making the interfaces variadic adds a small amount of specification complexity, but massively reduces implementation complexity of all code which uses executors. The executors themselves would not necessarily become more complex (sync_execute in the simplest case will call invoke, and async_execute would in the simplest case use thread - both of which are already variadic).

hcedwar commented 8 years ago

Agree and strongly support Harmut's position on execute variadic interface.
If my pull request were honored our collaboration would have a version controlled discussion point outside of an issue thread.

mjgarland commented 8 years ago

I'm inclined to agree that the customization points would benefit from being variadic, while we could leave the actual methods on the executors as is. This would centralize the concerns of parameter forwarding in the customization points. Moreover, it would provide an incentive to use the customization points rather than the methods directly, since the customization points would offer the convenience of a variadic interface.

Hartmut, is this what you had in mind or are you also suggesting that the methods themselves be required to have a variadic interface?

hkaiser commented 8 years ago

While I was thinking of all execution functionality, having at least the customization points being variadic would be a step in the right direction.

@mjgarland What rationale do you have for wanting the executor interfaces to expect nullary functions? I have described the advantages of making those variadic above, but I still fail to see what we could possibly gain from constraining ourselves to nullary functions. In my opinion, at least we don't loose anything from making them variadic, but most probably we'd gain in simplicity.

jaredhoberock commented 8 years ago

Thanks for the discussion. Here's some rationale behind the scope of the customization point design.

I envision the functions in the execution namespace as a customization interface rather than a convenience interface. I agree that a separate convenience/usability interface would also be valuable, for the reasons you've mentioned in this thread, and other reasons as well. This is why this proposal includes both std::async and execution::async_execute. If execute::async_execute were made variadic, the two functions become redundant. The same is true of std::invoke and execution::execute. Separate, variadic bulk analogues of async & invoke make sense to me, but I think they are out of scope for this minimal proposal.

Moreover, I think it's important to reserve additional parameter slots in customization points for parameters which carry semantics that executors will actually need to customize. It's also not clear that there is any reason that an executor author would want to customize how variadic parameters get forwarded. Does anyone know of one?

Finally, I'm concerned variadic parameters would preclude an important additional functionality that bulk executors will want to provide. We envision that a future proposal will want to introduce the ability to create hierarchical executors to support things like GPUs and executors that are composed of a composition of other, more primitive executors. Each level of such a hierarchy will require a separate shared parameter factory. In such a world, the function signature I envision would look like this:

template<class Executor, class Function1, class Function2, class... Functions>
result_of_t<Function1(executor_shape_t<Executor>)>
bulk_execute(Executor& exec, Function1 f, Function2 result_factory, Functions... shared_parameter_factories);

If bulk_execute were made variadic in order to accommodate variadic parameters passed to f, it would preclude having multiple shared parameters. Since it's possible to close over variadic, agent-local parameters with a lambda, but not possible to do so for shared parameters, it seems to me that we should prefer the latter.

An interface like this isn't necessarily easy to use directly (what if I don't want to return a result? what if I don't have any shared parameters?), but it is easy for executor authors to customize because they only have to worry about essentials.

Is this line of reasoning convincing? What do people think of the idea of a separate convenience interface that would support variadic parameters as well as other features that aren't conveniently possible through the more primitive, customization interface? I think this gives us the best of both worlds -- the customization interface would be optimized for ease of executor authorship, while the convenience interface would be optimized for ease of executor use.

fraggamuffin commented 8 years ago

I understand the need for variadic as it is what we have now in async and the thread constructor both supports it. CP can work with it as we don't have lifetime issues because SYCL captures everything by value in the lambda and enclose it in an accessor object. But we might also work with nullary objects if you define a nullary object as: a functor object with by-value captures with a function call object that takes an index type.

hkaiser commented 8 years ago

@jaredhoberock Thanks for your explanation.

Separate, variadic bulk analogues of async & invoke make sense to me, but I think they are out of scope for this minimal proposal.

I'd tend to agree if the base-line interfaces exposed by executor were variadic to begin with.

Moreover, I think it's important to reserve additional parameter slots in customization points for parameters which carry semantics that executors will actually need to customize

It probably would be impossible (or at least very difficult) to pass arbitrary additional parameters to bulk-execute in a generic way. Every executor type may require passing a different set of parameters. An alternative would be to bundle any parameter for an executor with the executor itself. Needless to say, this would also allow for having a variadic interface.

Each level of such a hierarchy will require a separate shared parameter factory...

Could you please give an example for this? I'm not sure I fully understand what you mean by 'parameter factories' here.

jaredhoberock commented 8 years ago

The current wording in the proposal is cryptic about what's going on with the additional parameters to the bulk execute functions. Here's the basic idea.

Assume we want to return results from multi-agent executor operations, just like we can from single-agent executor operations. Because there are multiple execution agents involved in the bulk task, we need to treat returns a bit differently.

Suppose we want to create an executor abstracting this sort of for loop:

std::vector<int> result(n);
int shared_by_all_iterations = 0;
for(int i = 0; i < n; ++i)
{
  f(i, result, shared_by_all_iterations);
}

return result;

This for loop allows f to perform side effects on two variables: result & shared_by_all_iterations. The result variable is returned by this function, and the shared_by_all_iterations variable is discarded.

A bulk executor could wrap up this for loop in the bulk_execute() customization point:

template<class Function, class ResultFactory, class SharedFactory>
result_of_t<ResultFactory()>
bulk_execute(Function f, size_t n,
             ResultFactory result_factory,
             SharedFactory shared_factory)
{
  auto result = result_factory();
  auto shared = shared_factory();

  for(size_t i = 0; i < n; ++i)
  {
    f(i, result, shared);
  }

  return result;
}

It turns out that "factories" (i.e., just functions that produce some value that will be used as a parameter to some other function) seem to be the most general way to initialize these special parameters. Moreover, each individual executor can customize the physical location of these variables to ensure they may be accessed by execution agents efficiently.

Now, assume in a future proposal we want to introduce executors with hierarchical scopes. For example, because we want to exploit the physical hierarchies present in parallel architectures, or because we want to collect multiple, primitive executors together, and distribute work among them.

A future proposal could introduce hierarchically-scoped executors by generalizing the form of bulk_execute().

For example, this is what it would look like when implemented with doubly-nested for loops:

template<class Function, class ResultFactory, class OuterFactory, class InnerFactory>
result_of_t<ResultFactory()>
bulk_execute(Function f, tuple<size_t,size_t> shape,
             ResultFactory result_factory,
             OuterFactory outer_factory, InnerFactory inner_factory)
{
  auto result = result_factory();
  auto outer_shared = outer_factory();

  for(size_t i = 0; i < get<0>(shape); ++i)
  {
    auto inner_shared = inner_factory();

    for(size_t j = 0; j < get<1>(shape); ++j)
    {
      f(make_tuple(i,j), result_factory, outer_factory, inner_factory);
    }
  }

  return result;
}

Each additional nested scope requires a new factory parameter to initialize that scope's shared parameter. This implies a general form of bulk_execute() would look like this:

template<class Function, class ResultFactory, class... SharedFactories>
result_of_t<ResultFactory()>
bulk_execute(Function f, shape_type shape,
             ResultFactory result_factory,
             SharedFactories... shared_factories);

So, I am worried about introducing a variadic parameter pack into the bulk executor customization points, because I'm concerned it would preclude this use case of hierarchical executors.

Does that make sense?

jaredhoberock commented 8 years ago

After a phone call with Hartmut, we agreed that there should definitely be a "sugar" interface for creating work on executors, and this interface would incorporate variadic parameters and other amenities. Such a usability interface can be made separate from executor customization points (cf. std::async()/async_execute() & std::invoke()/sync_execute()), and is something which we should pursue in a follow up to this minimal paper.

fraggamuffin commented 8 years ago

So this is syntactic sugar to enable a varadic interface? Can we see an example?

jaredhoberock commented 8 years ago

So this is syntactic sugar to enable a varadic interface? Can we see an example?

What I have in mind for a "sugar" interface would be bulk-style analogues to functions like std::invoke and std::async. These functions aren't customization points. Rather, they are sugared, variadic interfaces which lower onto executor customization points such as execution::sync_execute and execution::async_execute.

We could introduce similar bulk-style analogues to std::invoke and std::async which would accept variadic parameters. bulk_invoke and bulk_async in the Agency library are examples of such interfaces. These are the kinds of features we can explore in a follow-up to the initial, minimal Issaquah paper.

ccmysen commented 8 years ago

so we spoke last week about dropping the variadic interfaces from the core, i don't see that state change in the current version of the repository (the advantage is that it makes it easier to add in things like allocators). are we able to close this based on that?

jaredhoberock commented 8 years ago

I think the remaining variadic interfaces in the paper are in the executor categories derived from OneWayExecutor. My understanding was that @chriskohlhoff was OK with eliminating them.

hkaiser commented 7 years ago

I would like to get back to the discussion related to the customization points for executors being variadic.

The main argument brought forward against variadic interfaces (see above) was that this would preclude us from having fully flexible bulk interfaces, as those would require optional arguments for various kinds of factory functions. Even more, @jaredhoberock states above that he anticipates to see even more special arguments for certain (hierarchical) executor types.

However, I think this makes the specification for the bulk functions (and the prescribed ways to use those) to be overly restrictive. The requirement for the bulk customization points to (perhaps optionally) accept various factory functions forces them into exactly one style of usage.

What is the rationale of specifying the bulk functions to be used as (for instance):

 bulk_execute(exec, f, shape, my_shared_factory);

over a more general:

auto shared = my_shared_factory();
bulk_execute(exec, f, shape, shared);

or possibly:

my_executor_with_shared_data my_exec(exec, my_shared_factory);
bulk_execute(my_exec, f, shape);

etc.?

Over-specifying the meaning of certain arguments to be passed to the customization points inhibits for those to express general use cases, even those we don't anticipate today.

Also, I believe the first (and current) way of specifying the bulk operations unnecessarily leaks executor implementation details through a supposedly very generic API.

Executors are meant to be the lowest level task scheduling interface (i.e. every related facility in the library and in user code should use them for spawning tasks). That means the executor APIs should be as generic as possible, without enforcing a certain style of coding.

jaredhoberock commented 7 years ago

What is the rationale of specifying the bulk functions to be used as (for instance):
bulk_execute(exec, f, shape, my_shared_factory);
over a more general:
auto shared = my_shared_factory();
bulk_execute(exec, f, shape, shared);

The short answer is that passing a factory function rather than a parameter itself is the most general purpose and efficient way to pass shared parameters to bulk operations.

There are a few concrete issues with the suggested second form which factories avoid.

Some important types are not movable, especially concurrency primitives such asstd::atomic and std::barrier. These will be common types of shared parameters. These parameters cannot moved into bulk operations like bulk_execute(), but they can be returned from factory functions in C++17.
Some important types are not efficient to copy, especially containers used as scratchpads. These will also be common types of shared parameters. Consider a matrix parameter. If passed as a parameter, the executor must first make a global temporary copy of the matrix. Then, a copy of this temporary must be made for each group of agents created by a bulk operation like bulk_execute(). On the other hand, when represented by a factory function, the result of the factory directly initializes a local object.
Controlling the placement of shared parameters will be critical for the performance for many use cases. It's not clear how to do that if shared parameters are passed directly rather than through a factory. On the other hand, in a future proposal, factories can be extended to support an associated allocator through an additional interface incorporated into the factory type requirements.

hkaiser commented 7 years ago

I think we can implement everything as desired by exposing a variadic bulk execution interface and a couple of additional wrapping executors exposing the functionality you're looking for.

Let's assume, we have a two-way-executor (for simplicity I'll show only the implementation of the synchronous bulk execute implementation). I also assume that it exposes functionality which is callable through a variadic customization point sync_bulk_execute:

struct two_way_executor
{
    template <typename F, typename ...Ts>
    void sync_bulk_execute(F && f, size_t n, Ts &&... ts)
    {
        for (size_t i = 0; i != n; ++i)
        {
            f(i, ts...);
        }
    }
};

namespace execution
{
    // define customization point
    template <typename Executor, typename F, typename Shape, typename ...Ts>
    auto sync_bulk_execute(Executor && exec, F && f, Shape const& shape, Ts &&... ts)
    {
        return exec.sync_bulk_execute(forward<Executor>(exec), forward<F>(f),
            shape, forward<Ts>(ts)...);
    }
}

Now it is trivial to achieve the functionality that is relying on factory functions for shared data by defining another executor which exposes functionality similar to bulk_sync_execute as specified by P0443:

template <typename Executor, typename RF, typename SF>
struct shared_data_executor
{
    template <typename F, typename ...Ts>
    auto sync_bulk_execute(F && f, size_t n, Ts &&... ts)
    {
        // implement as needed using the factory functions, while relying on
        // wrapped executor to do the work
        auto shared_results = rf(n);
        auto shared_data = sf();

        // from the standpoint of 'f' there is no discernible difference
        // to a direct implementation as mandated by P00443
        return execution::sync_bulk_execute(exec,  forward<F>(f), n,
            std::ref(shared_results), std::ref(shared_data), 
            forward<Ts>(ts)...);
    }

    Executor exec;
    RF rf;
    SF sf;
};

template <typename Executor, typename RF, typename SF>
shared_data_executor<decay_t<Executor>, decay_t<RF>, decay_t<SF> >
make_shared_data_executor(Executor && exec, RF && rf, SF && sf)
{
    return shared_data_executor<decay_t<Executor>, decay_t<RF>, decay_t<SF> >{
        forward<Executor>(exec), forward<RF>(rf), forward<SF>(sf) };
}

Now invoke almost as before:

size_t shape = 100;
auto f = [](size_t, auto& shared_results, auto& shared_data) { ... };
auto result_factory = [](size_t n) {...};
auto shared_factory = []() {...};

two_way_executor base_exec;
execution::sync_bulk_execute(
    make_shared_data_executor(base_exec, result_factory, shared_factory),
    f, shape);

Do I miss something?

jaredhoberock commented 7 years ago

As a follow-up to our phone call, I said I would provide some examples of how our executors use these factories.

Here is an implementation of an executor which creates a group of concurrent threads on every call to bulk_then_execute(). As an optimization, it places the shared state on the first thread's stack.

Here is a CUDA implementation of our GPU executor. The implementation is fairly complex, so I'm not sure how illustrative this code is. I will try to update this issue with a simpler GPU executor which still demonstrates essentially the same point, which is that we want to be able to use a CUDA __shared__ variable to store the shared state.

jaredhoberock commented 7 years ago

Here's an update with an example of how a very simple GPU executor uses a CUDA __shared__ variable as storage for a shared parameter. The example doesn't actually do anything very interesting, but it should illustrate how the special semantics of the factories are used to communicate where the corresponding parameter may be located.

jaredhoberock commented 6 years ago

@hkaiser In Albuquerque, you mentioned that you had a use case whereby executors customize their behavior based on the values received as variadic execution function parameters. Could you share some details in this thread?

biddisco commented 6 years ago

@jaredhoberock, @hkaiser I have a couple of use cases I'm working on. The simplest is something like an executor that accepts a priority

hpx::async(priority_exec, 8.75, callable, param1, param2, ...);

and this executor just happens to be using a scheduler on a thread pool that supports a priority queue for tasks and forwards the priority to the scheduler, and the rest of the params onwards to the callable as usual. You could argue that this isn't really a need for variadic arguments since this is a special overload of async itself.

A more sophisticated use is for scheduling based on the numa domain that a memory block is assigned to. Consider the function

void compute_dgemm(future<matrix_tile> && t1, future<matrix_tile> &&t2) {
  expensive_calculation(t1.get(), t2.get());
}

and it turns out that numa efffects come into play if this task is executed on numa node 0, when the tiles being multiplied were first-touched by numa node 1. To solve this problem, I have a guided_executor that is templated over a numa_hint_function which in this minimal example would look like

template <typename MatrixType>
struct HPX_EXPORT pool_numa_hint<MatrixType, cholesky_tag> {
  int operator()(const MatrixType& matrix1, const MatrixType& matrix2) const {
    // the memory being written to will come from matrix2
    const typename MatrixType::ElementType* address2 = matrix2.ptr();
    return threads::get_topology().get_numa_domain(address2);
  }
};

The guided executor accepts the task to be run, but delays scheduling of it until the futures represented by the two matrix tiles are ready. At this point, it can extract the values from futures t1,t2 and call the hint function with the two actual tiles of data (not the futures), and the system can tell us which numa domain currently owns the memory. We can then pass the numa_hint on to the scheduler, so that it can place the task on a core assigned to that memory controller. The futures are then passed into the real task as usual so that the async()/.then() interface is unchanged (note that the future unwrapping relies on some hpx tricks to extract the value from the future for the hint function, without the future being invalidated, so we can reuse it, and is tangential to the main discussion).

The implementation of this numa guided executor relies on the fact that the hpx executor interface is variadic and so that the pack of arguments can be used to generate the correct calls to the hint function from the unwrapped parameters. It may be possible to do this another way, but this worked for me and is proving extremely useful.

I'm not sure if this brief example is enough to fully appreciate the use-case - in which case I'm happy to elaborate.

vinniefalco commented 5 years ago

How will the polymorphic executor wrapper forward a parameter pack into the type-erased function object?

jaredhoberock commented 5 years ago

The execution functions are not variadic, so the polymorphic executor doesn’t have to attempt that.

executors / executors

Execution interfaces should be variadic #9