Missed optimization opportunity in Simd::new

farnoy commented 4 years ago

I'm trying to initialize a Simd<f32x4> value from memory, but the safe way of doing it does a function call.

On simba 0.2.0, using the following rust code:

#[inline(never)]
pub fn exact_chonker_1(xs: &[f32]) -> simd::f32x4 {
    let mut acc = simd::f32x4::splat(1.0);
    let mut chunk_iter = xs.chunks_exact(4);

    for chunk in &mut chunk_iter {
        acc *= simd::f32x4::new(chunk[0], chunk[1], chunk[2], chunk[3]);
    }

    let mut r = simd::f32x4::splat(9999.0);
    let mut mask = simd::m32x4::splat(false);

    for (ix, item) in chunk_iter.remainder().iter().enumerate() {
        r.replace(ix, *item);
        mask.replace(ix, true);
    }

    acc * r.select(mask, simd::f32x4::splat(0.0))
}

#[inline(never)]
pub fn exact_chonker(xs: &[f32]) -> simd::f32x4 {
    let mut acc = simd::f32x4::splat(1.0);
    let mut chunk_iter = xs.chunks_exact(4);

    for chunk in &mut chunk_iter {
        acc *= simd::f32x4::from(*unsafe { std::mem::transmute::<*const f32, &[f32; 4]>(chunk.as_ptr()) });
    }

    let mut r = simd::f32x4::splat(9999.0);
    let mut mask = simd::m32x4::splat(false);

    for (ix, item) in chunk_iter.remainder().iter().enumerate() {
        r.replace(ix, *item);
        mask.replace(ix, true);
    }

    acc * r.select(mask, simd::f32x4::splat(0.0))
}

Analyzing the assembly, I'm seeing the initialization for f32x4::new() work like this:

 mov     r14, qword, ptr, [rip, +, _ZN5simba4simd16packed_simd_impl61Simd$LT$packed_simd..Simd$LT$$u5b$f32$u3b$$u20$4$u5d$$GT$$GT$3new17h20fdaa67639b0938E@GOTPCREL]
 xor     ebp, ebp
 lea     r12, [rsp, +, 64]
.LBB10_4:
 vmovaps xmmword, ptr, [rsp, +, 48], xmm3
 vmovd   xmm0, dword, ptr, [r13, +, 4*rbp]
 vmovd   xmm1, dword, ptr, [r13, +, 4*rbp, +, 4]
 vmovss  xmm2, dword, ptr, [r13, +, 4*rbp, +, 8]
 vmovss  xmm3, dword, ptr, [r13, +, 4*rbp, +, 12]
 mov     rdi, r12
 call    r14
 vmovaps xmm3, xmmword, ptr, [rsp, +, 48]

So the compiler seems to really want to call this Simd::new method, despite not using the results (xmm0-2 are not used as source operands later).

With the 2nd variant of the code, as far as I can tell the first loop just translates to:

.LBB11_4:
 vmulps  xmm0, xmm0, xmmword, ptr, [rsi, +, 4*rcx]
 add     rcx, 4
 cmp     rdx, rcx
 jne     .LBB11_4

Which is so much cleaner.

Is there anything you could think of that is inhibiting this for ::new()? Also, is there a better way to initialize from packed memory with safe rust?

sebcrozet commented 4 years ago

Hi!

The fact the compilers really wants to call ::new() may be caused by the fact that ::new is not marked as #[inline] there. Adding the issing #[inline]may fix this.

Also, is there a better way to initialize from packed memory with safe rust?

We could add a .from_slice_unaligned() to use instead of ::new. And perhaps it would make sense to add some kind of .from_iterator(iter, default) which fills the lanes of the f32x4 with the iterator content, or default when the iterator does not yield enough values to fill all the lanes.

farnoy commented 4 years ago

Thanks for the quick response!

I'll try the #[inline] idea soon. I'm new to these optimizations, do they only make sense for libraries where you want inlining across crates? Does the annotation also affect same-crate but different module function calls?

What would be the contract for from_slice_unaligned()? That sounds like reading from &[u8] where I'm interested in &[Self::Element], which would have to be aligned?

sebcrozet commented 4 years ago

I'll try the #[inline] idea soon. I'm new to these optimizations, do they only make sense for libraries where you want inlining across crates? Does the annotation also affect same-crate but different module function calls?

The #[inline] is mostly for inline across crates, yes. I think it can also affect inlining within the same crate when the compiler uses multiple codegen units.

What would be the contract for from_slice_unaligned()? That sounds like reading from &[u8] where I'm interested in &[Self::Element], which would have to be aligned?

That would be reading from &[Self::Element]. It would basically just call the corresponding from_slice_unaligned function from packed_simd. It is unaligned because we have no guarantee that &slice[0] is aligned to, e.g., a 16-byte boundary (i.e. align_of::<f32x4>()).

farnoy commented 4 years ago

Inlining helps, nice catch!

That would be reading from &[Self::Element]. It would basically just call the corresponding from_slice_unaligned function from packed_simd. It is unaligned because we have no guarantee that &slice[0] is aligned to, e.g., a 16-byte boundary (i.e. align_of::()).

Got it, I forgot that SIMD loads have stricter alignment requirements. In my experiments, rustc usually outputted the aligned versions of instructions, but I guess it's seen how the allocation is done to begin with.

All of this sounds good to me. I don't love the fact that these functions can panic. Will send a PR shortly

dimforge / simba

Missed optimization opportunity in Simd::new #11