SparrowLii / vectorization

auto vectorization for rust
4 stars 2 forks source link

Project status #2

Closed cuihantao closed 1 year ago

cuihantao commented 1 year ago

Hello!

I'm new to rust and found this repository from here: https://github.com/rust-ndarray/ndarray/issues/46. Are the changes in this repository going to make into the compiler? It will save a significant amount of efforts in vectorization.

Thanks

SparrowLii commented 1 year ago

Thans for your attention! But I don't think there's anything to do with the ndarray and the compiler. Vectorization in the compiler is oriented towards more general numerical computations, without the need for scenario-specific functionality such as ndarray.

SparrowLii commented 1 year ago

About the project status, some previous discussions in Rust's internal forum: https://internals.rust-lang.org/t/mir-optimization-pass-that-implements-auto-vectorization/16360

In general, the community thinks that automatic vectorization should be the work of LLVM and not in rustc.

cuihantao commented 1 year ago

Thank you for letting me know!

cuihantao commented 1 year ago

I later used Vec from the stdlib instead of ndarray for element-wise multiplication. To my surprise, the compiler vectorizes the code very well. Code as simple as below just works with SIMD.


    pub fn g_update(&mut self) -> &Self{
        for (dest, p1, p2, p3, p4, p5) in izip!(
            &mut self.dest,
            &self.p1,
            &self.p2,
            &self.p3,
            &self.p4,
            &self.p5
        ) {
            *dest = p1 * p2 * p3 * p4 * p5;
        }
        self
    }

The asm reads below

.LBB20_9:  // major loop for packs of 4
 movupd  xmm0, xmmword, ptr, [r8, +, 8*rbx]
 movupd  xmm1, xmmword, ptr, [r8, +, 8*rbx, +, 16]
 movupd  xmm2, xmmword, ptr, [r9, +, 8*rbx]
 mulpd   xmm2, xmm0
 movupd  xmm0, xmmword, ptr, [r9, +, 8*rbx, +, 16]
 mulpd   xmm0, xmm1
 movupd  xmm1, xmmword, ptr, [r10, +, 8*rbx]
 mulpd   xmm1, xmm2
 movupd  xmm2, xmmword, ptr, [r10, +, 8*rbx, +, 16]
 mulpd   xmm2, xmm0
 movupd  xmm0, xmmword, ptr, [rdi, +, 8*rbx]
 mulpd   xmm0, xmm1
 movupd  xmm1, xmmword, ptr, [rdi, +, 8*rbx, +, 16]
 mulpd   xmm1, xmm2
 movupd  xmm2, xmmword, ptr, [rsi, +, 8*rbx]
 mulpd   xmm2, xmm0
 movupd  xmm0, xmmword, ptr, [rsi, +, 8*rbx, +, 16]
 mulpd   xmm0, xmm1
 movupd  xmmword, ptr, [r14, +, 8*rbx], xmm2
 movupd  xmmword, ptr, [r14, +, 8*rbx, +, 16], xmm0
 add     rbx, 4
 cmp     r11, rbx
 jne     .LBB20_9
 cmp     r15, r11
 je      .LBB20_16
.LBB20_11:  // remaining entries
 mov     rcx, r11
 or      rcx, 1
 test    r15b, 1
 je      .LBB20_13
 movsd   xmm0, qword, ptr, [r8, +, 8*r11]
 mulsd   xmm0, qword, ptr, [r9, +, 8*r11]
 mulsd   xmm0, qword, ptr, [r10, +, 8*r11]
 mulsd   xmm0, qword, ptr, [rdi, +, 8*r11]
 mulsd   xmm0, qword, ptr, [rsi, +, 8*r11]
 movsd   qword, ptr, [r14, +, 8*r11], xmm0
 mov     r11, rcx

I haven't gotten ndarray to vectorize other than using Zip or azip!, which is limited to six iterants. The standard Vec just works fine for my purpose. Posting it for reference, but more than likely you are already aware of it.