aes-soft: Improve Bitslicing Performance

RustCrypto / block-ciphers

Collection of block cipher algorithms written in pure Rust

671 stars 131 forks source link

aes-soft: Improve Bitslicing Performance #65

Closed zer0x64 closed 4 years ago

zer0x64 commented 4 years ago

NCCGroup recently did a formal audit of RustCrypto's AES/GCM and ChaCha20Poly1305 for MobileCoin and published the report here: https://research.nccgroup.com/2020/02/26/public-report-rustcrypto-aes-gcm-and-chacha20poly1305-implementation-review/

One of the issue was regarding the bitslicing implementation of the crate aes-soft which could be made more efficient. This is regarding performance, it has no effect on the security of the implementation.

tarcieri commented 4 years ago

Something interesting to consider here, particularly for embedded (Cortex-M) and other 32-bit targets is providing an AES implementation based on fixslicing. The original paper on the technique is here (applied to the GIFT block cipher):

https://eprint.iacr.org/2020/412

The same researchers are working to apply the same techniques to AES and have a paper under submission with C and assembly code for AES-128 which I have early access to, and also with the goal of implementing AES-256 using the technique as well.

I can link it here once final publication is complete, or if you're interested in getting early access, I can put you in touch with the authors.

newpavlov commented 4 years ago

Interesting! But I would like to finish #135 and work on some other issues first, so I think I will wait the final version. Well, unless the authors are interested in including Rust code as well. :)

BTW for those who will read this issue, but not the PR. The audit states:

Interestingly, theaes-softimplementation works over 32-bit values (u32) but defines a custom 128-bit type. <..> It is unclear why it is done that way. If the actual registers are 32-bit values, there should be no additional benefit toincreasing parallel evaluation beyond what can fit in a register: an XOR of two 32-bit registers is one operation, but theXOR of two 128-bit custom types will involve four individual 32-bit XOR operations

Doing so does improve performance, since compiler will store u32x4 in xmm registers and will use SIMD instructions on them (SSE2 is enabled by default for x86-64 targets).

tarcieri commented 4 years ago

Yeah, sounds good, I think #135 will be very helpful.

I think the main interesting thing we can get out of the fixslice implementation is the Thumb assembly implementation. They report new speed records on these platforms (<100cpb, I'll leave the final number to be published in the paper).

tarcieri commented 4 years ago

We've done a lot of work in this recently, most notably #174, #176, #177, and #180, but also #171 and #175.

AES encryption now uses "fixslicing" (#174), the fastest available approach, and hopefully makes aes-soft state-of-the-art there. #180 expands fixslicing with a 64-bit equivalent.

Decryption speed is still not great, but it's not needed for CTR mode ciphers (which are the main ones we care about). It's potentially possible to use fixslicing for decryption as well.

If someone cares greatly about immediately improving decryption speed, we can reopen this issue.