P1673R12 LWG presentation

A) Previous WG21 talks and proposals

For more background, please see
- Recorded talk (less than 20 mins): https://youtu.be/n7mBGDqSzlQ
- Talks at Kona 2019 and elsewhere
- Design justification in P1674 and P1673 intro
- Historical background in P1417

B) Brief summary

Algorithms operating on 'mdspan' "viewing" users' data
- Iterators / ranges are to C++ Standard algorithms as mdspan are to P1673 algorithms
- Users can describe access to their data structures with mdspan
Extends BLAS (Basic Linear Algebra Subroutines)
- Releases 1979, 1987, 1989; Standard 2002
- Over four decades of continuous use
- P1673's extensions
  - ANY data types with a number-ish interface, including mixed
  - ANY data layouts
  - Parallel algorithms (ExecutionPolicy&&)
Extensible
- Implementations can add optimizations without interface changes
- Generalizable to batched (P2901)

C) Example: matrix-vector product

using vector_t = mdspan<double, dextents<int, 1>>;
using matrix_t = mdspan<double, dextents<int, 2>>;

// ... allocate and fill ...
vector_t x = /* ... */;
vector_t y = /* ... */;
matrix_t A = /* ... */;

// compute y = Ax
matrix_vector_multiply(y, A, x);

// compute norm2 of y
double val = vector_norm2(y);

// mixed precision, different layout 
mdspan<float, dextents<int, 2>, layout_left> A_f =
  /* allocate and fill */;
matrix_vector_multiply(y, A_f, x);
double val2 = vector_norm2(y);

D) Categories of things in P1673

Algorithms taking mdspan
In-place transformations
Tags
Layouts (layout_blas_packed)

E) Algorithms taking `mdspan`

Many algorithms, but wording for each is short, so much so that the header synopsis almost doubles the wording length. Would LWG consider letting us drop the synopsis?

E.1) Two categories of algorithms

Not-reduce-like: Take zero or more input mdspan(s) + (output or in/out) mdspan; return void
- Example: matrix_vector_product
reduce-like: Take one or more input mdspan(s) + initial scalar value; return scalar result
- Example: vector_norm2

E.2) [linalg.reqs.val] "Linear algebra value types"

The most interesting part of wording is not the algorithms themselves, but "Linear algebra value types" [linalg.reqs.val]: requirements on the mdspans' value types that interact in algorithms. One can think of this as a generalization of _GENERALIZED_SUM_. SG6 and LEWG especially discussed this in detail. Davis Herring and Zach Laine, among others, generously contributed wording review time.

F) In-place transformations

These are functions "viewing" an mdspan as another mdspan, with a possibly different layout and/or accessor. (We don't actually use the word "view.")

F.1) Main interface: `conjugated`, `scaled`, `transposed`

These are the "viewing" functions mentioned above. They take an mdspan (and possibly a scalar), and return another mdspan.

Function name	Effect	What it changes
`transposed`	Matrix transpose `(i,j)` -> `(j,i)`	Layout
`scaled`	Multiply all elements by a scalar	Accessor
`conjugated`	Complex conjugate of all elements	Accessor
`conjugate_transposed`	`transposed` and `conjugated`	Both

Key feature: Result is an mdspan. P1673's algorithms represent arrays as mdspan. P1673 defines no extension point mechanism to let algorithms take more general things.

Design principle: P1673 is "the BLAS in C++," not a general algebra system. For example:

mdspan's accessor has no knowledge of the multidimensional index pack given to the layout mapping; it doesn't know "where it is in the matrix."
Thus, mdspan cannot represent a diagonally scaled matrix (where each column or row is scaled by a different scaling factor). This has implications for the batched generalization of P1673, P2901R0.
Thus, mdspan alone cannot represent an unpacked Hermitian or triangular matrix (where the interpretation of an element's value depends on its index (i,j)), though algorithms can interpret a matrix that way. This is why P1673 has different algorithms for symmetric_*, hermitian_*, and triangular_* cases.

F.2) Functions that change the accessor

Functions that change the mdspan's accessor have more complicated wording than functions that change the layout. Mechanism is like atomic_accessor* (P2689R2):

Define a custom accessor
Custom accessor uses a proxy reference to transform value
Proxy reference has overloaded arithmetic operators (like atomic_ref*)

We do this twice:

Function	Accessor	Proxy reference
`scaled`	`accessor_scaled`	`scaled_scalar`
`conjugated`	`accessor_conjugate`	`conjugated_scalar`

Accessors are not exposition-only

The accessors that use these proxy reference types are not exposition-only. This lets users write custom algorithms that can optimize by knowing that an mdspan comes from conjugated and scaled results, just as P1673's implementations can optimize for this case.

Proxy reference design goals

Work with users' custom value types (or proxy reference types, or expression templates), without those types needing overloaded arithmetic operators for P1673's proxy reference types
- e.g., add(scaled(alpha, x), y, z) where y's reference is atomic_ref<float>
Preserve arithmetic order for nested scaled expressions
- scaled(alpha, scaled(beta, x)) should not be transformed into scaled(alpha * beta, x), EVEN THOUGH this may let implementations use an optimized BLAS
- e.g., if x is a rank-1 mdspan whose value_type is double, and if sizeof(int) is 4 and double is IEEE 754 binary64, then scaled(1 << 20, scaled(1 << 20, x)) does not overflow int, but scaled((1 << 20) * (1 << 20), x) would overflow int
- e.g., alpha = uint64_t(INT_MAX/2), beta = INT_MAX/2 + 1, and x[0] = 2
Algorithms' arithmetic expressions must work for arbitrary combinations and nestings of proxy reference types, as long as the underlying arithmetic makes sense
- Nesting of scaled and/or conjugated expressions is not a typical use case (P1673 is not a computer algebra system), but should be correct.
Avoid duplicated wording, especially for overloaded arithmetic operators.

P1673R9 solved these problems effectively by specifying an implementation (proxy-reference). We're not attached to this wording approach.

Key features of `proxy-reference`

Three overloads per binary arithmetic operator

Three overloads per binary arithmetic operator, each with different constraints, ensure no ambiguity in expressions with different/same proxy reference types.

constexpr friend auto operator+(derived_type lhs, Rhs rhs), with Rhs one of our proxy references.
- It converts "the other proxy reference" to its value_type immediately, so users' value_types, proxy references, or expression templates don't need to overload arithmetic operators for P1673's values.
- Using "its value_type," rather than mine, preserves precision in mixed-precision expressions. For example, if Rhs::value_type is double and my value_type is float, converting to my value_type (float) would lose precision.
constexpr friend auto operator+(derived_type lhs, Rhs rhs), with Rhs NOT one of our proxy references, uses rhs directly. (This requires overloaded arithmetic operators for (value_type(lhs), rhs). See below for other options.)
constexpr friend auto operator+(Lhs lhs, derived_type rhs), with Rhs NOT one of our proxy references, is order-reversed (b).

`abs`, `real`, `imag`, and `conj` overloads

The abs, real, imag, and conj overloads are part of a system in P1673 for working around issues with these existing functions in the Standard, while respecting the convention that existing generic linear algebra libraries use and overload these functions.

`proxy-reference` wording errata

Replace enable_if_t with constraints (we've tested that this builds with GCC 13.1.0)

`proxy-reference` wording questions and our answers

Here, we preempt some possible questions about our wording choices.

"Deducing this"? No need
- Only one "base class customization point," operator value_type()
- Using deducing this would require making it a template
Other ways to specify arithmetic operators?
- Base class proxy-reference lets us avoid ambiguity and code duplication.
  - We could repeat hidden friend arithmetic operators in each proxy reference class, but they would still need to be constrained to avoid ambiguity.
  - Base class CRTP design may have compile-time cost, but is it actually observable?
- is_base_of_v<@_proxy-reference-base_@, Rhs> is just a way to say "this is a known proxy reference type with a public value_type alias."
  - P1673 does NOT attempt to define generic requirements for proxy references.
    - It only knows about scaled_scalar and conjugated_scalar.
    - This has functionality consequences. For example, proxy-reference's arithmetic operators do NOT convert atomic_ref<T> to T (value_type conversion) on input to their arithmetic expressions. (Only scaled_scalar and conjugated_scalar get that treatment.) Therefore, atomic_ref<complex<T>> CANNOT be used in P1673's algorithms, as atomic_ref<complex<T>> lacks overloaded arithmetic operators (unlike atomic_ref<Real>). P1673 accepts this, based on the principle that "P1673 is not a computer algebra system" and that it should defer to the behavior of its input types.
  - It's tempting to change is_base_of_v<@_proxy-reference-base_@, Rhs> constraint to requires { typename Rhs::value_type; }, but this won't work in general.
    - While atomic_ref and generalizations in P2689 have value_type alias, these were designed for use with mdspan, where value_type has a particular expected meaning.
    - Users' types may define value_type differently: e.g., a hypothetical custom bigfloat may store its bits in uint128_t, have a value_type = uint128_t alias, and have a (perhaps ill-advised) conversion operator to value_type.
    - vector<bool>::reference has no value_type alias (perhaps it should? but it's existing precedent and may have inspired other designs)

G) Tags

upper_triangle_t
lower_triangle_t
implicit_unit_diagonal_t
explicit_diagonal_t

These tags don't refer to layouts. Instead, they tell algorithms how to interpret data in place. Idiomatic BLAS and LAPACK use reinterprets a matrix's "properties" in place, as a result of algorithms that operate in place on the matrix. Examples:

LU factorization replaces the contents of a (possibly nonsymmetric) square matrix A with L and U factors. triangular_matrix_vector_solve with the same mdspan either reads L or U, depending on the tag.
Cholesky with lower_triangle would operate on a symmetric matrix in place, and produce a triangular matrix. The data in the array remain symmetric (A[i, j] == A[j, i]), but the interpretation becomes "lower triangular matrix."

H) Layouts: `layout_blas_packed`

Packed triangular, symmetric, or Hermitian layouts.

ORNL / cpp-proposals-pub