cspan: Suggesting an ergonomic C99 API with multidimensional matrices for BLIS

tylov commented 9 months ago

This is more a question whether there is interest for a small, fast and generic C99 implementation of a numpy-alike multidimensional array-view API (currently header file ~200 LOC)? I am not currently a BLIS user, but I gathered it may have some interest for C users of BLIS. It does not currently compile with C++ (can be done), but C is the primary target. It supports

row- and column-major layouts, transpose
type safety, bounds checking by default
full numpy slicing/subview capablilities (currently not negative increment step)
fast linear iteration of multidimensional (and sliced) arrays.

Note that C++23 std::mdspan will not support the two last bullets at all (maybe in C++26). The API could possibly be adapted at some level as a "contribution API", for making it more convenient to use BLIS directly from C, by extending it to wrap BLIS-function calls, as shown in the example below. I am not aware of any similar C library with these specific features, ergonomics, and small package. I made a recent reddit post here. The example below is a rewrite of the example in the c++ std::mdspan proposal

// C99:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>
#include "cspan.h"

using_cspan3(Mat, float); // shorthand for defining Mat, Mat2, and Mat3

typedef Mat2 OutMat;
typedef struct { Mat2 m00, m01, m10, m11; } Partition;

Partition partition(Mat2 A)
{
  int32_t M = A.shape[0];
  int32_t N = A.shape[1];
  return (Partition){
    .m00 = cspan_slice(Mat2, &A, {0, M/2}, {0, N/2}),
    .m01 = cspan_slice(Mat2, &A, {0, M/2}, {N/2, N}),
    .m10 = cspan_slice(Mat2, &A, {M/2, M}, {0, N/2}),
    .m11 = cspan_slice(Mat2, &A, {M/2, M}, {N/2, N}),
  };
}

#if 0
void base_case_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  cblas_sgemm(
    CblasColMajor, CblasNoTrans, CblasNoTrans,
    C.shape[0], C.shape[1], A.shape[1], 1.0f,
    A.data, A.stride.d[1], B.data, B.stride.d[1], 1.0f,
    C.data, C.stride.d[1]);
}
#else
// Slow generic implementation
void base_case_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  for (int j = 0; j < C.shape[1]; ++j) {
    for (int i = 0; i < C.shape[0]; ++i) {
      Mat2_value C_ij = 0;
      for (int k = 0; k < A.shape[1]; ++k) {
        C_ij += *cspan_at(&A, i,k) * *cspan_at(&B, k,j);
      }
      *cspan_at(&C, i,j) += C_ij;
    }
  }
}
#endif

void recursive_matrix_product(Mat2 A, Mat2 B, OutMat C)
{
  // Some hardware-dependent constant
  enum {recursion_threshold = 32};
  if (C.shape[0] <= recursion_threshold || C.shape[1] <= recursion_threshold) {
    base_case_matrix_product(A, B, C);
  } else {
    Partition c = partition(C),
              a = partition(A),
              b = partition(B);
    recursive_matrix_product(a.m00, b.m00, c.m00);
    recursive_matrix_product(a.m01, b.m10, c.m00);
    recursive_matrix_product(a.m10, b.m00, c.m10);
    recursive_matrix_product(a.m11, b.m10, c.m10);
    recursive_matrix_product(a.m00, b.m01, c.m01);
    recursive_matrix_product(a.m01, b.m11, c.m01);
    recursive_matrix_product(a.m10, b.m01, c.m11);
    recursive_matrix_product(a.m11, b.m11, c.m11);
  }
}

int main(void)
{
  enum {N = 10, D = 256};
  float* values = malloc(N * D * D * sizeof values[0]);
  float out[D * D];

  Mat3 data = cspan_md(values, N, D, D); // default row-major
  c_foreach (i, Mat3, data)
    *i.ref = ((double)rand()/RAND_MAX - 0.5)*4.0;

  OutMat c = cspan_md_layout(c_COLMAJOR, out, D, D);
  Mat2 a = cspan_submd3(&data, 0);
  cspan_transpose(&a);

  clock_t t = clock();
  for (int i=1; i<N; ++i) {
    Mat2 b = cspan_submd3(&data, i);
    cspan_transpose(&b);
    memset(out, 0, sizeof out);
    //base_case_matrix_product(a, b, c);
    recursive_matrix_product(a, b, c);  // 2x faster -O3
  }
  t = clock() - t;

  double sum = 0.0;
  c_foreach (i, Mat2, c) sum += *i.ref;  // linear iterate md span.
  printf("sum=%.16g, %f ms\n", sum, (double)t*1000.0/CLOCKS_PER_SEC);
  free(values);
}

jeffhammond commented 9 months ago

This is neat, but does it need to be part of BLIS or can it live as an independent project that uses BLIS and potentially CBLAS?

mhoemmen commented 9 months ago

Note that C++23 std::mdspan will not support the two last bullets at all (maybe in C++26).

FYI, submdspan ("slicing / subview capabilities") was voted into the C++26 draft ( see status tracker: https://github.com/cplusplus/papers/issues/1293 ).

Both the reference implementation of mdspan ( https://github.com/kokkos/mdspan ), and NVIDIA's implementation of mdspan in the HPC SDK, have always provided the version of submdspan that was in P0009 (that works for layout_left, layout_right, and layout_stride). The version of submdspan that was voted into the C++26 draft lets users make slicing work for their custom layouts.

mhoemmen commented 9 months ago

I share Jeff's sentiment that this is neat, btw : - ) .

tylov commented 9 months ago

Thanks to both of you! Yes, cspan has obviously less features than the mdspan proposed slicing. After all, they spent nearly a decade on it. But I think cspan will still cover a lot of use cases, and it is even less verbose and simpler to use than std::mdspan in some cases.

This is neat, but does it need to be part of BLIS or can it live as an independent project that uses BLIS and potentially CBLAS?

Even though cspan is a self-contained, generic library, my assumption was that it is most useful in the context with blis/cblas, although dynamic multi-dimensional arrays have other useful applications.

It surely doesn't need to be a part of BLIS, but many hesitate to use libraries in the wild, particularly without it first being approved/recommended in the community/documentation (it's also just a single small file currently). My thought was that it could be a foundation for an easier (and maybe less bug-prone?) way to call blis/cblas functionality, provided that someone added wrapper functions that takes cspan matrices as arguments.

mhoemmen commented 9 months ago

... and it is even less verbose and simpler to use than std::mdspan in some cases.

Unfortunately, WG21 did not bless our multiple efforts to reduce mdspan's verbosity. For example, we suggested a small change to the language's definition of incomplete type in order for us to spell extents more concisely, but we didn't have enough experience at the time to show that it wouldn't break something. We also suggested more concise spellings of dynamic_extent and full_extent in other meetings, to no avail.

That being said, mdspan is very much a C++ library, with C++ ideas about customization. C should do things the C way; Fortran should do things the Fortran way. Nevertheless, it should be straightforward for this library to interact with mdspan. If you store the pointer first, the extents second, and the strides last, it should even be bitwise copyable into an mdspan with the matching layout and run-time extents (with caveat that C++ doesn't yet have P2642, which would provide for BLAS-like or overaligned layouts).

tylov commented 9 months ago

I have wondered if some of the syntax could have been a little less verbose. As an (earlier) c++ developer for two decades, mdspan is an impressive software engineerig achievement, in particular to get it through the needle's eye of the commitee.

While talking about std::mdspan, I belive Bryce Lelbach mentioned in one of his talks that joined/flattened iteration was left out because of lacking performance or compilers inability to optimize it. I think this should be revised for C++26, because with cspan, joined iteration is only 50% slower than raw nested loops with clang 16.0.6 on my hardware. In this case, I iterate two 3D sliced spans and add elements of one to the other. With gcc 13.2 it is less than 2X slower. These loops are already incredible fast, so 2X is still very fast and useful. The benchmark is here.

clang 16.0.6 (win11)
forloop : 49.0 ms, 6816804761602.707031
joined  : 74.0 ms, 6816804761602.707031
nested  : 49.0 ms, 6816804761602.707031

gcc 13.2 (win11)
forloop : 54.0 ms, 6816804761602.707031
joined  : 93.0 ms, 6816804761602.707031
nested  : 49.0 ms, 6816804761602.707031

mhoemmen commented 9 months ago

@tylov Thanks for your kind words about mdspan! It was a long process.

Usually when C++ developers think of "iteration," they think of iterators (or C++20 ranges, which are iterator-like). Those are things that look like pointers, so that *p++ accesses the current element and goes on to the next one. An mdspan iterator would act like a "pointer to the current element," so that for (p = x.begin(); p != x.end(); ++p) { f(*p); } calls f on each element of the multidimensional array.

Iterators iterate over the range of the mdspan. The domain of the mdspan is its extents, the object that expresses the array's bounds. C++26 already has multidimensional iteration over extents via ranges (views::cartesian_product and views::iota).

It's rare that I see code that iterates over the elements of a single mdspan. Usually, code either accesses multiple mdspan with the same domain, or is a "stencil loop" (that reads elements other than the current element, and modifies the current element). In both those cases, iterating over the extents is more helpful than iterating over the elements.

What's currently missing in C++ is parallel ranges. The C++17 parallel algorithms (e.g., for_each, reduce, and transform overloads whose first parameter is an ExecutionPolicy&&) don't currently work with the views::cartesian_product + views::iota range that expresses iteration over extents.

Bryce actually wrote a paper on multidimensional iteration a few years ago. He found that flattening via iterators didn't work so well, because multidimensional iterators are stateful. Coroutines actually worked better.

I've written a bit on implementing generic mdspan iteration in this reference mdspan issue, and written a code example in this pull request.

devinamatthews commented 9 months ago

This is already pretty far off-topic but quick plug for a library I wrote called MArray. It is very much like mdspan but supports owning and non-owning (view) containers, sub-slicing, iteration, both fixed and variable-dimension arrays, expression templates, etc.

tylov commented 9 months ago

Thanks guys, very informative. I'll take look at theses. And yes, we are off-topic, so you may close the issue/discussion.

rvdg commented 9 months ago

May I suggest this discussion be continued on the BLIS discord server?

flame / blis

cspan: Suggesting an ergonomic C99 API with multidimensional matrices for BLIS #772