Add initial aarch64 neon support

redzic commented 2 years ago

This PR adds a NEON implementation of all mem{r}chr functions. This PR does not add NEON support for memmem, as deeper changes in the code are needed for the Vector trait for it to work efficiently with NEON. memmem support could possibly be addressed in a future PR.

Since AArch64 was introduced as part of ARMv8, and since ARMv8 requires NEON support, no runtime detection mechanism is needed for AArch64.

redzic commented 2 years ago

@BurntSushi do you know how to run the entire benchmark suite for the functions in this crate only (i.e., not libc or anything else) so that they are formatted like they are in this comment: https://github.com/BurntSushi/memchr/pull/103#issuecomment-1000980153?

Oops, in the comment it says what command was used, sorry for the ping!

redzic commented 2 years ago

Here are the initial benchmark numbers (tested on an 8-core M1 Pro):

So it seems like we only lose for small/tiny haystacks with more common occurrences. I think reducing unrolling would probably help in that case (since there is a lot of overhead in searching much more than we need to when the needle is likely to be found within only the first 16 bytes, especially for memchr2/3), but it seems difficult to analyze when to reduce unrolling based on the haystack size and frequency of needle occurrences in the haystack.

redzic commented 2 years ago

I think this is what is remaining for this PR:

Ensure this works on big-endian (since there are apparently some big endian ARM processors). I'll try to see if I can run the test suite on QEMU or something like that to make sure it works.
Try aligning memory accesses. I only saw about a 2-3% improvement (if that) on the memchr1/huge/never benchmark, but maybe it's still worth trying again and testing the other benchmarks as well.
Probably should add some comments explaining how certain parts of the code work, since there are notable differences from the sse2/avx2 implementation (mostly that the x86 code uses movemask while the ARM code uses masking and pairwise addition to ultimately figure out the matching index, and pairwise max to figure out whether the equality vector contains all zeros).
Add some kind of check in build.rs to disable the NEON code if the rustc version is too low. I'll have to do some more research to figure out how to do this.

Hoping to get any sort of initial feedback though to see if the PR is going in the right direction so far.

CeleritasCelery commented 1 year ago

@redzic What do you most need help with to complete this PR? Sounds like 2 and 4 from your list above are the most obvious candidates.

redzic commented 1 year ago

@CeleritasCelery Any help would be good, but I guess the "bureaucratic" stuff like the build.rs version check is what's preventing this from being merged now. Or maybe @BurntSushi has been busy and hasn't had time to review the code and make sure everything is correct and up to the same quality as the rest of the codebase (and also I think if I remember correctly he didn't have an arm machine to test on which makes things more difficult).

CeleritasCelery commented 1 year ago

I don't believe we need a build.rs version check. If the rustc version is too low, target_has_feature("neon") will return false and it won't compile the neon code. I tested this with 1.58.0 (the version before NEON intrinsics were stabilized) and it worked as expected (neon code was conditionally removed).

BurntSushi commented 1 year ago

I plan to bump the MSRV to Rust 1.60 soon, so we definitely shouldn't need to worry about version sniffing here.

Also, I have an M2 mac mini in the mail so that I can actually start playing around and testing this stuff too. (And get it working for memmem and also Teddy in aho-corasick.)

CeleritasCelery commented 1 year ago

I took some time over the weekend to look into this implementation. I thought the use of pair-wise addition as a replacement for movemask was pretty clever. Also using highly generic code resulted in saving about 200 lines overall. I do have a concern (perhaps unfounded) that it will be harder to reliable optimize code that is so generic. Particularly when comparing [T; N] to just having multiple variables. But I don't have a lot to base that on and would need to test it out.

I did a couple of refactoring things like replacing manual loops with map.

before:

let mut nv: [uint8x16_t; N] = [vdupq_n_u8(0); N];
for i in 0..N {
    nv[i] = vdupq_n_u8(n[i]);
}

after:

let nv = n.map(|x| vdupq_n_u8(x));

I found an article from arm about how to emulate SSE movemask in NEON. It even specifically mentions memchr. Their recommendation was to use shrn (shift right and narrow). I implemented that and saw a notable improvement in the mem{r}chr{2,3} benchmarks. However the mem{r}chr1 benchmarks were more mixed. I saw improvements every benchmark but the huge/common ones. Those saw a significant 20% slowdown. I tracked this down to moving the "movemask" out of the if-chain or not. If I precompute the masks, then huge/common is about as fast as the pair-wise add version (the current PR). But all the other benchmarks are 5-20% worse compare to computing the masks in the if-chains. Not sure why that would be the case. My best guess is that huge/common tends to have the needle at the end of a 64 byte chunk, so computing upfront will give better ILP in that case.

repo

my repo is located at https://github.com/CeleritasCelery/memchr/tree/aarch64-neon

@redzic let me know if you would like me to merge any of these changes into your PR.

benchmarks

current PR baseline compared to shrn version

Screenshot 2023-07-10 at 2 41 49 PM

shrn compared with shrn and pre-computed masks

Screenshot 2023-07-10 at 2 40 34 PM

BurntSushi commented 1 year ago

@CeleritasCelery Thank you! Especially for that link. I just got my M2 mac mini setup yesterday and I'm now looking into this. I think I grok all the critical parts of memchr on aarch64 except for one: the use of umaxp. The article says this:

We would like to note that the main loop in memchr still looks for the maximum value out of four 32-bit integers through umaxp instruction: when comparing, it checks that the max byte is not zero. If it is not, it uses shrn to get the mask. Experiments showed this is faster for strings (>128 characters) as on cores like Neoverse N1, it uses 2 pipelines V0/V1 whereas shrn uses only one pipeline V1, but both have the same latency. Such an approach showed better results overall and is suitable for more workloads. So, if you are checking for existence in a loop, consider using umaxp instruction followed by shrn: it might have a threshold where it is faster than only using shrn.

With the relevant snippet of code being:

  92 L(loop32):
  93         ldr     qdata, [src, 16]!
  94         cmeq    vhas_chr.16b, vdata.16b, vrepchr.16b
  95         umaxp   vend.16b, vhas_chr.16b, vhas_chr.16b            /* 128->64 */
  96         fmov    synd, dend
  97         cbnz    synd, L(end)

And the specific thing they're talking about in the article, umaxp:

umaxp   vend.16b, vhas_chr.16b, vhas_chr.16b

But I don't understand the utility of this function. It looks like vend.16b is just going to be equivalent to vhas_chr.16b, since it's just taking the pairwise max of each element, but is using the same vector for both source registers.

The code then seems to go on to only check the lower 64 bits of vend.16b. Which... can't be right. So I'm missing something about how umaxp is being used. I'm pretty sure I'm just misreading it. Can anyone shine a light here?

I'll note that there is a horizontal max vector op, and that seems like what you'd want here. But that's umaxv, not umaxp. So why not use umaxv here?

BurntSushi commented 1 year ago

cc @danlark1 in case you were inclined to share your wisdom here. It looks like you are the author of https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon :-)

CeleritasCelery commented 1 year ago

But I don't understand the utility of this function. It looks like vend.16b is just going to be equivalent to vhas_chr.16b, since it's just taking the pairwise max of each element, but is using the same vector for both source registers. The code then seems to go on to only check the lower 64 bits of vend.16b. Which... can't be right. So I'm missing something about how umaxp is being used. I'm pretty sure I'm just misreading it. Can anyone shine a light here?

This is describing how we check if the vector is zero.

unsafe fn eq0(x: uint8x16_t) -> bool {
    let low = vreinterpretq_u64_u8(vpmaxq_u8(x, x));
    vgetq_lane_u64(low, 0) == 0
}

vpmaxq_u8 is the vector version of umaxp. it takes 2 16-byte vectors and pair-wise reduces it down to a single 16-byte vector. calling it on the same vector leads to the result of the pair-wise max being duplicated in the upper 8 bytes and the lower 8 bytes. then we can extract the lower 8 bytes as a u64 and compare to zero. Or if you apply a mask first, you could use leading/trailing bit count to get the offset just like with movemask.

Here is a diagram to (hopefully) make it clearer

a = [1, 2, 3, 4, ..., 15, 16; 16]
b = [1, 2, 3, 4, ..., 15, 16; 16]
# after calling vpmaxq on vector, each lane contains the max of a pair of lanes
vpmaxq = [1&2, 2&3, 4&5, ..., 15&16, 1&2, 3&4, ..., 15&16; 16]

danlark1 commented 1 year ago

Hi, thanks for including me

umaxp (unsigned maximum pairwise) is an instruction which takes 2 consecutive bytes of the concatenation of second and third operands and produces the biggest byte of every pair. It means if the second and third operands are the same, then first 16 bytes will be reduced to 8 bytes where every result_byte[i] = max(source_byte[2 * i], source_byte[2 * i + 1]);

Then

fmov    synd, dend
cbnz    synd, L(end)

takes 64 bit integer and compares it with zero. It is correct to check whether you have any match inside a vector. As said, umaxp is better when the range is big enough as it utilizes 2 pipelines where shrn just 1 on modern server hardware.

You are also correct about umaxv, it's a vertical minimum, however, on server platforms for Neoverse N2 it has latency 2 or 4 and uses V1 (but no latency 2 and both V0/V1) whereas umaxp's latency is only 2 where both pipelines are used.

Say, if you test it on Graviton 3, umaxp will be better.

For Apple I think it should be the same

SHRN is good for emulating movemask and thus pretty good for small/mid ranges for mem* functions. umaxp is good for long ranges and increasing the throughput. I believe it's consistent with the benchmarks.

Then there comes the question of what you are optimizing for. Most memchr are called for strings up to 64-128 bytes where shrn dominates. In aarch64 optimized routines we decided to have shrn for strings <=16 bytes, umaxp for >16 bytes because we haven't seen much regression for small-mid ranges. I wanted to go only with shrn as memchr for big strings is rare in production but did not push hard. Any version will be a great default in my opinion.

BurntSushi commented 1 year ago

Nice nice. So it turns out I was looking at vmaxq_u8 instead of vpmaxq_u8. Derp. Thank you all for the explanations, they were very helpful. Also thank you @danlark1 for that extra info about umaxv. I suspected as much that it would be sub-optimal in some respect.

BurntSushi commented 1 year ago

OK, so now that I've sat down and familiarized myself enough with neon such that I could write a memchr routine (and the memmem routines too), I've finally looked at this PR with a closer eye. I think that while this looks great, it is generally not my style and not something I'm keen on maintaining. The code is way too hard for me to follow. I generally prefer the duplicative but simpler code found in x86-64. I'm going to experiment with writing memchr routines for aarch64 that more closely mimic the x86-64 routines, and potentially even make those generic based on a lower level generic Vector trait like I did for memmem. I'm not sure if I'll wind up there though because of the movemask shenaningans. But if I don't my answer then will be to just duplicate code.

jacobbramley commented 12 months ago

Hi! We (Arm) are happy to look at this kind of thing. I'm not sure if rustbot works here, but if it does, then you can ping us that way. Alternatively, we lurk on Zulip in t-compiler/arm (amongst other streams). If there's anywhere else we should watch for requests like this, please let us know!

BurntSushi commented 12 months ago

@jacobbramley Oh nice! There is no rustbot here, but once I have something for aarch64 I'll ping y'all for review. Thank you for chiming in.

llogiq commented 10 months ago

I think I have an alternative to the movemask dance: If we can afford a pre-loaded POS_MASK = uint8x16_t(0, 1, 2, 3, 4, ..., 15), we can use vminvq_u8(vorrq_u8(vmvnq_u8(eq), POS_MASK)) to get either the first matching position (in 0..16) or 255 if no match was found. Not sure if that's actually faster, but it seems to use less steps. I don't have an ARM to test with though.

BurntSushi commented 10 months ago

@llogiq I almost have this done. I'm basically in the final polishing/testing/measuring phases. It should be very easy to experiment with alternative approaches at that point. :-)

BurntSushi commented 10 months ago

Closed by #129 (see PR comment for benchmark results).

Thanks everyone for pushing this forward. I don't think I would have ended up with something so good without it. :)

BurntSushi / memchr