P1684: mdarray iteration over elements needs a parallel execution policy

@crtrott Moved from https://github.com/kokkos/mdarray/issues/13.

Problem summary

Users want an mdarray that iterates over elements (to initialize, copy, or move them) using a specific execution policy. They may even want to use a specific "execution policy instance" (e.g., CUDA stream). Users can achieve this by using a custom container with the current design for all but two mdarray operations, since all other mdarray operations defer to the container for initialization, copying, and moving. The two mdarray operations that do not defer to the container are mdarray constructors that need to iterate over the elements of an mdspan.

  template<class OtherElementType, class OtherExtents,
           class OtherLayoutPolicy, class Accessor>
    explicit(see below)
    constexpr mdarray(mdspan<OtherElementType, OtherExtents,
                             OtherLayoutPolicy, Accessor> other);
  template<class OtherElementType, class OtherExtents,
           class OtherLayoutPolicy, class Accessor,
           class Alloc>
    explicit(see below)
    constexpr mdarray(mdspan<OtherElementType, OtherExtents,
                             OtherLayoutPolicy, Accessor> other,
                      const Alloc& a);

Desired features

Users would like two things, in decreasing order of priority.

They would like to give these constructors a possibly nondefault CUDA stream ("execution policy instance") for copying the elements.
They would like multidimensional copying not to be slow, in the case that the input mdspan is not contiguous.

Proposed solution

Summary:

Customization happens in mdarray(mdspan, const Alloc&) constructor
Specify parallel multidimensional array copy
Specify a way to get an optional execution policy instance from Allocator
Specify a way to get an mdspan accessor from container

Customization happens in `mdarray(mdspan, const Alloc&)` constructor

There are only two reasonable places to store and access a nondefault CUDA stream: the input mdspan's Accessor, and the second constructor's allocator (Alloc) parameter. Of these two things, only allocators have anything to do with execution. Accessor just describes how to get a reference from a pointer and an offset, while C++ allocators are also tied to construction and destruction of objects in allocated memory. Furthermore, a constructor that takes an allocator instance strongly suggests possibly nondefault allocation behavior. Thus, it's reasonable to limit custom execution policy behavior to the constructor that takes an allocator instance.

Note that Alloc need not necessarily be a C++ allocator. We could instead specify customization points for getting an allocator and execution policy from Alloc.

Specify parallel multidimensional array copy

The only way to copy elements in parallel in C++20 (and likely C++23) is the parallel overload of std::copy. Using this to copy a multidimensional mdspan into a (1-D) container would require an input iterator range for the mdspan. This is possible even for noncontiguous mdspan by using cartesian_product, iota, and transform (to map from a multidimensional index range to a range over elements). However, doing so would flatten the multidimensional index range. This has two performance issues.

Optimal iteration order depends on the layout of both the mdspan and the mdarray.
Flattening loses potential vendor optimizations for multidimensional array copy (e.g., cudaMemcpy3D).

P1673 proposes an mdspan copy algorithm with a parallel ExecutionPolicy&& overload. This would let vendors solve both performance issues. It may be reasonable to split copy from the rest of P1673.

Other ways of copying the elements, such as using the container's (from_range_t, R&&, const Allocator&) constructor, would have the same flattening issue.

Specify a way to get an optional execution policy instance from `Allocator`

There is currently no generic way to get a "preferred execution policy instance" from an allocator. Without changing this, vendors could not use an existing mdarray implementation to get the desired features. They would need to subclass or wrap mdarray.

Specify a way to get an mdspan accessor from container

P1673's copy copies from one mdspan to another. The output mdspan needs an accessor. This means that mdarray needs some way to get the preferred accessor from a given container.

The existing mdarray::operator mdspan starts with default_accessor<ElementType>, and assumes that this can be assigned to the resulting mdspan type. This doesn't solve the problem of needing a custom container's preferred accessor.

Solution sketch

template<class OtherElementType, class OtherExtents,
         class OtherLayoutPolicy, class Accessor,
         class Alloc>
  explicit(see below)
    constexpr mdarray(mdspan<OtherElementType, OtherExtents,
      OtherLayoutPolicy, Accessor> other, const Alloc& a)
  // New container constructor takes a without_initializing_t tag.
  // If this constructor doesn't exist,
  // use container_(other.required_span_size(), a) instead
  // and take the performance hit of re-initializing.
  // Custom container could extract preferred execution policy
  // (e.g., CUDA stream) from Alloc for e.g., cudaMallocAsync.
  : container_(without_initializing, a),
    map_(other.mapping())
{
  // container_accessor customization point gets the container's preferred accessor.
  // It defaults to default_accessor<ElementType>.
  auto output_accessor = container_accessor(container_);
  mdspan<ElementType, Extents, LayoutPolicy, decltype(output_accessor)>
    output{container_.data(), map_, output_accessor};

  // execution_policy customization point gets the Alloc input's
  // preferred execution policy instance.
  auto exec_policy = execution_policy(a);

  // P1673 puts copy in the std::linalg namespace.
  // Splitting copy into a separate proposal would likely change this.
  linalg::copy(exec_policy, other, output);
}

ORNL / cpp-proposals-pub