Request: design description

grothesque commented 2 months ago

Thanks for sharing this very interesting library! I've been trying to understand its inner workings, but this is difficult since some central aspect seem both very clever (in a positive sense), and are not documented at all. I think that adding a "DESIGN.md" file (even a small one) may greatly increase the usefulness of this project to other people (be it possible contributors or members of the wider numeric Rust community).

For example, would you mind to describe how Span works? I note that it is a zero-sized struct, and only &Span are passed around. But what do these actually point to? And what is the role of RawSpan?

Or let's take Expr / Expression. I first thought that it's a view with a somewhat badly chosen name, but it seems to be much more than that. Are types that implement the trait Expression "expression templates" in C++ speak?

What is the relation of Expr and Span and why are both needed? You write that spans are like Rust slices, but slices are fat pointers, while spans seem to be just pointers that point to... something (that is owned by whom?).

It would be also interesting to have a brief discussion of the design aspects of C++'s mdspan/mdarray that you took over, and those that you didn't/couldn't and the reasons for that.

fre-hu commented 2 months ago

Thanks for the suggestion and comments! I have added design notes in the latest commit as a start. The layout types are omitted for now since I'm looking into if they can be improved.

grothesque commented 2 months ago

Very useful, many thanks.

Here's one thing that is still unclear to me: Would it be possible to modify the example program such that a or b could be created using the array! macro, i.e. with static size? Or in other words, is there some way to cast a static stack-allocated array into a DSpan?

If this is not possible, how would one write a (generic?) function that accepts input created with either expr! or array!?

grothesque commented 2 months ago

I think I found a solution myself. The following generic function accepts references to arrays as well:

pub fn matmul<SA, SB, SC>(a: &Span<f64, SA>, b: &Span<f64, SB>, c: &mut Span<f64, SC>)
where
    SA: Shape,
    SB: Shape,
    SC: Shape,
{
    for (mut cj, bj) in c.cols_mut().zip(b.cols()) {
        for (ak, bkj) in a.cols().zip(bj) {
            for (cij, aik) in cj.expr_mut().zip(ak) {
                *cij = aik.mul_add(*bkj, *cij);
            }
        }
    }
}

Here is my understanding why the original non-generic matmul function did not accept references to stack-allocated arrays: these arrays have constant sizes, so that there is nowhere an appropriate RawSpan that could be pointed to. Is this correct?

Still, is there a way to make an array!-allocated array work with the original matmul? That would probably require creating a view of it somehow (that would contain a RawSpan instance).

fre-hu commented 2 months ago

Yes, right. One way is to do a.reshape([2, 2]) or so which will change the shape type to have dynamic dimensions.

grothesque commented 2 months ago

One piece of information that I feel should be added somewhere is while there are both layouts and mappings. The way I understand it is that while types that implement the Layout trait (e.g. Dense) do not store any data, types that implement the Mapping trait (e.g. DenseMapping) do store data, i.e. the shape and the necessary strides.

So, if my understanding is correct, an Array (stack-allocated, static-sized array) instance does not keep a Mapping struct, while a Grid or Expr does.

grothesque commented 2 months ago

Yes, right. One way is to do a.reshape([2, 2]) or so which will change the shape type to have dynamic dimensions.

Thanks, this solution was not obvious to me. But it requires redundantly restating the shape, and panics if it doesn't match. The following seems to be more robust: a.reshape(a.shape().dims()), however it doesn't work with reshape_mut due to the borrow checker complaining.

Perhaps this should be a method, how about to_expr?

Or perhaps there should be even an equivalent impl of Borrow or AsRef? (This might be a bad idea, I don't have much experience with this.) In this way, the call to matmul would look the same no matter whether the arguments are borrowed as DSpan or as Span.

fre-hu commented 2 months ago

I added more conversions for Array to be consistent with normal arrays. It should work now with &Expr::from(&a). Let's see if it is sufficient.

grothesque commented 2 months ago

Here is an improved generic version of matmul that uses the type system to ensure that the static sizes (if used) do match:

pub fn matmul_generic<D0: Dim, D1: Dim, D2: Dim>(
    a: &Span<f64, (D0, D1)>,
    b: &Span<f64, (D1, D2)>,
    c: &mut Span<f64, (D0, D2)>,
) {
    for (mut cj, bj) in c.cols_mut().zip(b.cols()) {
        for (ak, bkj) in a.cols().zip(bj) {
            for (cij, aik) in cj.expr_mut().zip(ak) {
                *cij = aik.mul_add(*bkj, *cij);
            }
        }
    }
}

Is this use of mdarray "idiomatic"?

The above function has also the advantage of allowing the compiler to optimize the inner loop for a fixed number of iterations. For example, the following non-generic function

pub fn matmul_opt(a: &DSpan<f64, 2>, b: &DSpan<f64, 2>, c: &mut DSpan<f64, 2>) {
    let a = a.reshape((Const::<4>, Dyn(2)));
    let mut c = c.reshape_mut((Const::<4>, Dyn(2)));
    matmul_generic(&a, &b, &mut c);
}

is compiled to assembly that uses the fact that the inner loop has exactly four iterations. (I verified looking at the assembly generated by cargo rustc --release -- -Ctarget-feature=+fma --emit asm.) However, it only uses the FMA instruction vfmadd213sd and not its packed variant vfmadd213pd - I guess that's because the compiler cannot be sure about sufficient alignment. (I saw that there are ways to allocate Grid arrays with a specific alignment, so there seem to be ways.)

fre-hu commented 2 months ago

Yes having Dim parameters per dimension is the way to do it.

I did some testing, and it seems that the missed optimization is due to limitations in alias analysis. I got it to generate packed FMAs by breaking out the inner loop in a separate function. Somehow the Span references must be visible in function arguments for it to optimize.

If I increase the column length to 64 or make it dynamic, the compiler generates different code paths and checks for aliasing at runtime with the original function.

pub fn mul_col<D: Dim>(cj: &mut Span<f64, D>, ak: &Span<f64, D>, bkj: &f64)
{
    for (cij, aik) in cj.expr_mut().zip(ak) {
        *cij = aik.mul_add(*bkj, *cij);
    }
}

pub fn matmul_generic<D0: Dim, D1: Dim, D2: Dim>(
    a: &Span<f64, (D0, D1)>,
    b: &Span<f64, (D1, D2)>,
    c: &mut Span<f64, (D0, D2)>,
) {
    for (mut cj, bj) in c.cols_mut().zip(b.cols()) {
        for (ak, bkj) in a.cols().zip(bj) {
            mul_col(&mut cj, &ak, bkj);
        }
    }
}

pub fn matmul_opt(a: &DSpan<f64, 2>, b: &DSpan<f64, 2>, c: &mut DSpan<f64, 2>) {
    let a = a.reshape((Const::<4>, Dyn(2)));
    let mut c = c.reshape_mut((Const::<4>, Dyn(2)));
    matmul_generic(&a, &b, &mut c);
}

fre-hu / mdarray

Request: design description #1