Significantly improve performance by using `chunks_exact(SIMD_LEN)` and `remainder()`

smu160 commented 3 months ago

The as_simd call is not guaranteed to gives us the perfect output of prefix, middle, suffix (where we know middle will have everything we need). You'd probably have to guarantee some sort of alignment in order to make that work well. So, we'll always have to check the prefix/suffix and there's no guarantee that the middle will even have any elements. We can get around this by going by to the slice api and just using chunks_exact(SIMD_LEN) followed by checking in chunks_exact(SIMD_LEN).remainder().

For example:

    fn all_equal_simd(&self) -> bool {
        let arr = self.as_slice();
        if arr.is_empty() {
            return true;
        }
        let first = arr[0];
        let simd_needle = Simd::splat(first);
        let remainder = arr.chunks_exact(SIMD_LEN).remainder();

        for chunk in arr.chunks_exact(SIMD_LEN) {
            let simd_chunk = Simd::<T, SIMD_LEN>::from_slice(chunk);
            let mask = simd_chunk.simd_ne(simd_needle).to_bitmask();
            if mask != 0 {
                return false;
            }
        }

        remainder.iter().all(|x| *x == first)
    }

You can see the generated asm for yourself on godbolt. You'll have to comment out my version and uncomment yours to see the differnce in cycle count. Of course, the proof is in the pudding. Using the benchmarks in PR #8, I got great preliminary results!

Let me know your thoughts. Thank you!!

LaihoE commented 3 months ago

The reason as_simd() gives the odd splits is to guarantee that the middle elements are aligned to simd_len, allowing us to do load_aligned with simd. This should be a little faster than unaligned loads, especially with bigger N. On small N it is very possible that chunks_exact is faster.

I was also looking at the performance yesterday (it was much slower than memchr etc.) and noticed that the if statements complicate unrolling and we need to do a bit of loop unrolling ourselves:

https://godbolt.org/z/88nM38hhn

should give something like this:

.LBB0_11:
        add     rdi, 4
        je      .LBB0_12
        movdqa  xmm1, xmmword ptr [r10 + 16]
        pcmpeqb xmm1, xmm0
        movdqa  xmm2, xmmword ptr [r10]
        pcmpeqb xmm2, xmm0
        movdqa  xmm3, xmmword ptr [r10 + 48]
        pcmpeqb xmm3, xmm0
        por     xmm3, xmm1
        movdqa  xmm1, xmmword ptr [r10 + 32]
        pcmpeqb xmm1, xmm0
        por     xmm1, xmm2
        movdqa  xmm2, xmmword ptr [r10 + 64]
        pcmpeqb xmm2, xmm0
        movdqa  xmm4, xmmword ptr [r10 + 80]
        pcmpeqb xmm4, xmm0
        movdqa  xmm5, xmmword ptr [r10 + 112]
        pcmpeqb xmm5, xmm0
        por     xmm5, xmm4
        por     xmm5, xmm3
        movdqa  xmm3, xmmword ptr [r10 + 96]
        pcmpeqb xmm3, xmm0
        por     xmm3, xmm2
        por     xmm3, xmm1
        por     xmm3, xmm5
        pmovmskb        r11d, xmm3
        sub     r10, -128
        test    r11d, r11d
        je      .LBB0_11

Using this trick I was able to get close to memchr() performance on position_simd() with u8's.

smu160 commented 3 months ago

Hi @LaihoE,

Thanks for getting back to me! I took the liberty of modifying your version just a bit. You can check it on godbolt.

The performance seems way too good to be true. Would you be able to run it on your end?

LaihoE commented 3 months ago

@smu160 I've also had some sus results with all_equal(), the other functions seem to have more reasonable results.

smu160 commented 3 months ago

Hi @LaihoE,

I realized I had a bug in my implementation that I shared via godbolt. For some reason I put eq vs ne when checking the chunks. That's probably the cause of the issue. Now I wish I could reproduce the examples that were causing this issue so I can add a regression test.

Anyhow, I'll finish the benchmarks to see it does compared to the other versions we mentioned here.

LaihoE / SIMD-itertools

Significantly improve performance by using `chunks_exact(SIMD_LEN)` and `remainder()` #10