Open mhoemmen opened 2 years ago
See also https://github.com/kokkos/mdarray/issues/20 . The above solution implies either being able to allocate the container without initializing, or something more complicated like ranges::to
construction with a new class of multidimensional iterators.
@crtrott Moved from https://github.com/kokkos/mdarray/issues/13.
Problem summary
Users want an
mdarray
that iterates over elements (to initialize, copy, or move them) using a specific execution policy. They may even want to use a specific "execution policy instance" (e.g., CUDA stream). Users can achieve this by using a custom container with the current design for all but two mdarray operations, since all other mdarray operations defer to the container for initialization, copying, and moving. The two mdarray operations that do not defer to the container aremdarray
constructors that need to iterate over the elements of anmdspan
.Desired features
Users would like two things, in decreasing order of priority.
mdspan
is not contiguous.Proposed solution
Summary:
mdarray(mdspan, const Alloc&)
constructorAllocator
Customization happens in
mdarray(mdspan, const Alloc&)
constructorThere are only two reasonable places to store and access a nondefault CUDA stream: the input
mdspan
'sAccessor
, and the second constructor's allocator (Alloc
) parameter. Of these two things, only allocators have anything to do with execution.Accessor
just describes how to get a reference from a pointer and an offset, while C++ allocators are also tied to construction and destruction of objects in allocated memory. Furthermore, a constructor that takes an allocator instance strongly suggests possibly nondefault allocation behavior. Thus, it's reasonable to limit custom execution policy behavior to the constructor that takes an allocator instance.Note that
Alloc
need not necessarily be a C++ allocator. We could instead specify customization points for getting an allocator and execution policy fromAlloc
.Specify parallel multidimensional array copy
The only way to copy elements in parallel in C++20 (and likely C++23) is the parallel overload of
std::copy
. Using this to copy a multidimensionalmdspan
into a (1-D) container would require an input iterator range for themdspan
. This is possible even for noncontiguousmdspan
by usingcartesian_product
,iota
, andtransform
(to map from a multidimensional index range to a range over elements). However, doing so would flatten the multidimensional index range. This has two performance issues.mdspan
and themdarray
.cudaMemcpy3D
).P1673 proposes an
mdspan
copy
algorithm with a parallelExecutionPolicy&&
overload. This would let vendors solve both performance issues. It may be reasonable to splitcopy
from the rest of P1673.Other ways of copying the elements, such as using the container's
(from_range_t, R&&, const Allocator&)
constructor, would have the same flattening issue.Specify a way to get an optional execution policy instance from
Allocator
There is currently no generic way to get a "preferred execution policy instance" from an allocator. Without changing this, vendors could not use an existing
mdarray
implementation to get the desired features. They would need to subclass or wrapmdarray
.Specify a way to get an mdspan accessor from container
P1673's
copy
copies from onemdspan
to another. The outputmdspan
needs an accessor. This means thatmdarray
needs some way to get the preferred accessor from a given container.The existing
mdarray::operator mdspan
starts withdefault_accessor<ElementType>
, and assumes that this can be assigned to the resultingmdspan
type. This doesn't solve the problem of needing a custom container's preferred accessor.Solution sketch