On my skylake processor, if I compile rustc-crypto using cargo rustc --release -- -C target-cpu=nativeCbcEncryptorProcessor::process_block will be unrolled and optimized with ymm registers for 64 bytes of input. If the input size is smaller than that (which it is when using 128bit AES), it'll execute an assembler loop reading, xor'ing and writing byte by byte. Testing with perf showed that over 25% of overhead is caused by this.
In particular I'm talking about the following code:
for ((&x, &y), o) in input.iter().zip(out_hist.iter()).zip(self.temp.iter_mut()) {
*o = x ^ y;
}
It produces roughly this assembler:
cmp r11,0x40 ; if smaller than 64 jump to byte-wise loop
mov edx,0x0
↓ jb loop
; loop optmized for 64 bytes stands here
loop: movzx ebx,BYTE PTR [rax+rdx*1]
xor bl,BYTE PTR [r15+rdx*1]
mov BYTE PTR [rsi+rdx*1],bl
inc rdx
cmp rdx,r11
↑ jb loop
end: ...
Testwise I replaced the source code with the following (horrific) code
for ((a, b), c) in input.chunks(16).zip(out_hist.chunks(16)).zip(self.temp.chunks_mut(16)) {
if a.len() == 16 && b.len() == 16 && c.len() == 16 {
let x: &[u64; 2] = unsafe { ::std::mem::transmute(a.as_ptr()) };
let y: &[u64; 2] = unsafe { ::std::mem::transmute(b.as_ptr()) };
let o: &mut [u64; 2] = unsafe { ::std::mem::transmute(c.as_ptr()) };
o[0] = x[0] ^ y[0];
o[1] = x[1] ^ y[1];
} else {
for ((&x, &y), o) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
*o = x ^ y;
}
}
}
This produced optimized assembler for 16 bytes needing only 2 loads, xor's and writes. That resulted in perf showing only an overhead of 14% for the encrypt function which increased the actual throughput by about 10%.
On my skylake processor, if I compile rustc-crypto using
cargo rustc --release -- -C target-cpu=native
CbcEncryptorProcessor::process_block
will be unrolled and optimized with ymm registers for 64 bytes of input. If the input size is smaller than that (which it is when using 128bit AES), it'll execute an assembler loop reading, xor'ing and writing byte by byte. Testing withperf
showed that over 25% of overhead is caused by this.In particular I'm talking about the following code:
It produces roughly this assembler:
Testwise I replaced the source code with the following (horrific) code
This produced optimized assembler for 16 bytes needing only 2 loads, xor's and writes. That resulted in
perf
showing only an overhead of 14% for the encrypt function which increased the actual throughput by about 10%.