dimforge / nalgebra

Linear algebra library for Rust.
https://nalgebra.org
Apache License 2.0
4.05k stars 485 forks source link

SIMD Integration #502

Open sebcrozet opened 5 years ago

sebcrozet commented 5 years ago

This is a revival of #27 and #217 and a follow-up to that comment. We need to figure out what we can do to use explicit SIMD in order to improve the overall performances of nalgebra.

Once rusty-machine is integrated to nalgebra (#498), we could use it for benchmarking to see if your optimizations have any impact on some real-world machine-learning operations.

We should also keep in mind the ffsvm crate as an example of application requiring peak performances for some linalg operations, and the simd_aligned crate for an example of an interesting design of a SIMD-friendly data structure for the data storage of a matrix or vector.

Here are some tasks we should start with to get some measurements that will serve at references for our optimizations. This will be useful to guide us through our work:

ralfbiedert commented 5 years ago

For ffsvm (and SVMs in general I would argue) most of the CPU time is spent inside one loop inside the kernel. I don't recall the exact numbers, but for our actual production model I think it was in the ballpark of ±80% for this code:

for (i, sv) in vectors.row_iter().enumerate() {
    let mut sum = f32s::splat(0.0);
    let feature: &[f32s] = &feature;

    for (a, b) in sv.iter().zip(feature) {
        sum += (*a - *b) * (*a - *b);
    }

    output[i] = f64::from((-self.gamma * sum.sum()).exp());
}

Here vectors is a "list of vectors", and each vector (sv and feature) is a [f32x8] (or similar). Each f32x8 is a std::simd("packed_simd") packed float, that is guaranteed to be properly aligned for SIMD operations, and has overloaded operators +, *, .sum() that map to native SIMD intrinsics if available.

As a developer, what I would need for ffsvm is roughly:

I think my benchmark baseline would be the equivalent of

    for (a, b) in sv.iter().zip(feature) {
        sum += (*a - *b) * (*a - *b);
    }

implemented in nalgebra to have the same performance as if implemented manually via std::simd.

Edit - I just tried to implement a naive benchmark and realized it's not that simple, since I'd have to deal with DVectors, which seem to be heap allocated. That means all temporaries would also be heap allocated. While it's fine to allocate ahead of time, there probably can't be any allocation in the hot loop.

ralfbiedert commented 5 years ago

Decide what crate crate we want to use for SIMD manipulation: faster, SIMDeez, std::simd, packed_simd?

Some thoughts:

If std::simd is still a thing, I would probably investigate that one first, as it might come with free cross-platform compatibility.

jswrenn commented 5 years ago

A thought: nalgebra's matrix types aren't always backed by totally contiguous storage. There aren't any gaps between elements if the Matrix is backed by a VecStorage or ArrayStorage, but there can be gaps if it's backed by a SliceStorage. I suspect this will be a barrier to a straight-forward drop-in of SIMD to accelerate matrix operations. We could get around this by lowering the actual implementations of operations we want to accelerate to a Storage level; that way we could provide specialized implementations for each of the three storage types.

HadrienG2 commented 5 years ago

@jswrenn Note that non-contiguous storage can also be purposely used to improve SIMD performance on small matrices. For example, if you pad matrix rows to a multiple of the SIMD vector length, and are careful not to introduce NaNs and other IEEE 754 silliness in the padding, you can implement a very fast small-matrix multiplication.

In general, SIMD is very sensitive to data layout, which makes abstraction design harder.

geoeo commented 5 years ago

Having "basic SIMD math" support for vector operations (+, -, *, .sum() ...)

I just want to put my two cents here. One bottleneck in my application seems to be matrix multiplication and to some extent addition. SIMD accelerated matrix multiplication would on its own be amazing.

These features alone would make nalgebra a lot more attractive for scientific computing.

Jerboas86 commented 4 years ago

Decide what crate we want to use for SIMD manipulation: faster, SIMDeez, std::simd, packed_simd?

Simd intrinsics for wasm32 target are accessible via core::arch::wasm32 and i don't think the above crates support this specific target. Simd wasm specs is in phase 3 and rust support can be followed here #74372. Love to see SIMD support for wasm implemented in nalgebra.

entropylost commented 3 years ago

I would really like to have a SIMD Vector3 without having to refactor all the code to use AoSoA.