AES-CBC: llvm vectorizer optimizes xor-loop for 64 bytes of input

On my skylake processor, if I compile rustc-crypto using cargo rustc --release -- -C target-cpu=native CbcEncryptorProcessor::process_block will be unrolled and optimized with ymm registers for 64 bytes of input. If the input size is smaller than that (which it is when using 128bit AES), it'll execute an assembler loop reading, xor'ing and writing byte by byte. Testing with perf showed that over 25% of overhead is caused by this.

In particular I'm talking about the following code:

        for ((&x, &y), o) in input.iter().zip(out_hist.iter()).zip(self.temp.iter_mut()) {
            *o = x ^ y;
        }

It produces roughly this assembler:

       cmp    r11,0x40 ; if smaller than 64 jump to byte-wise loop
       mov    edx,0x0
     ↓ jb     loop

       ; loop optmized for 64 bytes stands here

loop:  movzx  ebx,BYTE PTR [rax+rdx*1]
       xor    bl,BYTE PTR [r15+rdx*1]
       mov    BYTE PTR [rsi+rdx*1],bl
       inc    rdx
       cmp    rdx,r11
     ↑ jb     loop
end:   ...

Testwise I replaced the source code with the following (horrific) code

        for ((a, b), c) in input.chunks(16).zip(out_hist.chunks(16)).zip(self.temp.chunks_mut(16)) {
            if a.len() == 16 && b.len() == 16 && c.len() == 16 {
                let x: &[u64; 2] = unsafe { ::std::mem::transmute(a.as_ptr()) };
                let y: &[u64; 2] = unsafe { ::std::mem::transmute(b.as_ptr()) };
                let o: &mut [u64; 2] = unsafe { ::std::mem::transmute(c.as_ptr()) };
                o[0] = x[0] ^ y[0];
                o[1] = x[1] ^ y[1];
            } else {
                for ((&x, &y), o) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
                    *o = x ^ y;
                }
            }
        }

This produced optimized assembler for 16 bytes needing only 2 loads, xor's and writes. That resulted in perf showing only an overhead of 14% for the encrypt function which increased the actual throughput by about 10%.

DaGenix / rust-crypto

AES-CBC: llvm vectorizer optimizes xor-loop for 64 bytes of input #420