Open smu160 opened 3 months ago
The reason as_simd()
gives the odd splits is to guarantee that the middle elements are aligned to simd_len, allowing us to do load_aligned with simd. This should be a little faster than unaligned loads, especially with bigger N. On small N it is very possible that chunks_exact
is faster.
I was also looking at the performance yesterday (it was much slower than memchr etc.) and noticed that the if statements complicate unrolling and we need to do a bit of loop unrolling ourselves:
https://godbolt.org/z/88nM38hhn
should give something like this:
.LBB0_11:
add rdi, 4
je .LBB0_12
movdqa xmm1, xmmword ptr [r10 + 16]
pcmpeqb xmm1, xmm0
movdqa xmm2, xmmword ptr [r10]
pcmpeqb xmm2, xmm0
movdqa xmm3, xmmword ptr [r10 + 48]
pcmpeqb xmm3, xmm0
por xmm3, xmm1
movdqa xmm1, xmmword ptr [r10 + 32]
pcmpeqb xmm1, xmm0
por xmm1, xmm2
movdqa xmm2, xmmword ptr [r10 + 64]
pcmpeqb xmm2, xmm0
movdqa xmm4, xmmword ptr [r10 + 80]
pcmpeqb xmm4, xmm0
movdqa xmm5, xmmword ptr [r10 + 112]
pcmpeqb xmm5, xmm0
por xmm5, xmm4
por xmm5, xmm3
movdqa xmm3, xmmword ptr [r10 + 96]
pcmpeqb xmm3, xmm0
por xmm3, xmm2
por xmm3, xmm1
por xmm3, xmm5
pmovmskb r11d, xmm3
sub r10, -128
test r11d, r11d
je .LBB0_11
Using this trick I was able to get close to memchr() performance on position_simd() with u8's.
Hi @LaihoE,
Thanks for getting back to me! I took the liberty of modifying your version just a bit. You can check it on godbolt.
The performance seems way too good to be true. Would you be able to run it on your end?
@smu160 I've also had some sus results with all_equal()
, the other functions seem to have more reasonable results.
Hi @LaihoE,
I realized I had a bug in my implementation that I shared via godbolt. For some reason I put eq
vs ne
when checking the chunks. That's probably the cause of the issue. Now I wish I could reproduce the examples that were causing this issue so I can add a regression test.
Anyhow, I'll finish the benchmarks to see it does compared to the other versions we mentioned here.
The
as_simd
call is not guaranteed to gives us the perfect output ofprefix
,middle
,suffix
(where we knowmiddle
will have everything we need). You'd probably have to guarantee some sort of alignment in order to make that work well. So, we'll always have to check the prefix/suffix and there's no guarantee that the middle will even have any elements. We can get around this by going by to the slice api and just usingchunks_exact(SIMD_LEN)
followed by checking inchunks_exact(SIMD_LEN).remainder()
.For example:
You can see the generated asm for yourself on godbolt. You'll have to comment out my version and uncomment yours to see the differnce in cycle count. Of course, the proof is in the pudding. Using the benchmarks in PR #8, I got great preliminary results!
Let me know your thoughts. Thank you!!