Improving cipher parallelism

tarcieri commented 3 years ago

Taking a step back from https://github.com/RustCrypto/traits/pull/354, I thought it'd be good to look how and where ILP and SIMD parallelism is currently used across the project as a whole, and how that could be improved.

The only place we presently have any sort of parallelism abstraction at the trait-level is BlockCipher::ParBlocks. Otherwise various crates leverage e.g. SIMD internally. Regarding BlockCipher::ParBlocks specifically, the only crate that leverages it is the aes crate.

The following crates have SIMD backends:

Ciphers

aes
chacha20

UHFs/"MACs"

polyval
poly1305

AEADs

In AEADs, we'd like to glue the above crates together in fairly fixed combinations in order to leverage ILP, passing SIMD buffers from ciphers to UHFs for authentication:

aes-gcm/aes-gcm-siv: aes + ghash/polyval
chacha20poly1305: chacha20 + poly1305

(also aes-siv and pmac, but this is less of a priority)

In either of these cases there's a single specific buffer type I think it'd be nice for both the cipher implementation and UHF to support in common:

aes-gcm/aes-gcm-siv: "i128x8" i.e. [__m128i; 8] on x86/x86_64
chacha20poly1305: "i256x4" i.e. [__m256i; 4] on x86/x86_64

Concrete proposal

My suggestion is to get rid of BlockCipher::ParBlocks and replace it with more general SIMD types and traits designed to work with them, namely:

Add a new utils crate e.g. simd-buffers which provides "i128x8" and "i256x4" SIMD buffer types which are backed by __m128i/__m256i on x86/x86_64 and otherwise provide a portable implementation. These types don't need to implement any sort of arithmetic, just provide wrappers for passing data between SIMD implementations.
Add traits to cipher and universal-hash which operate on SIMD buffers.
Use SIMD buffers types in the implementations of aes-gcm, aes-gcm-siv, and chacha20poly1305

`cipher` API suggestion

I'd suggest adding traits to cipher which use the SIMD buffer types which are useful for both block ciphers and stream ciphers.

I also think it might make sense to use a generic parameter rather than an associated type to permit support for multiple buffer types (e.g. on newer CPUs, "i128x4" might be a better option for AES, but we can support both):

/// Note that for practical purposes, we only need to support block cipher encryption,
/// but there could also be a `BlockDecryptPar` for completeness/consistency.
pub trait BlockEncryptPar<B: SimdBuffer> {
    fn encrypt_par(&self, buffer: &mut B);
}

pub trait StreamCipherPar<B: SimdBuffer> {
    fn try_apply_keystream_par(&mut self, buffer: &mut B) -> Result<(), LoopError>;
}

`universal-hash` API suggestion

pub trait UniversalHashPar<B: SimdBuffer> {
    fn update_par(&mut self, blocks: &B);
}

SIMD `ctr` support

Trying to move the end-user facing aes-ctr types into aes has created a very annoying circular dependency between the block-ciphers and stream-ciphers repo. Furthermore, ctr is quite a bit more general now than what the CTR types in the aes crate provide, and also aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

If we had BlockEncryptPar and StreamCipherPar traits, the ctr crate could glue the two together, accepting a SIMD buffer as input, computing the next buffer of keystream output, and XORing the latter into the former. This would allow ctr to be generally SIMD optimized, and also mean we only have one ctr implementation to worry about instead of a separate one in the AES crate.

newpavlov commented 3 years ago

I would love to remove ParBlocks, but right now I don't see how your proposal will help us to improve our API. If anything, I think it needlessly will expose implementation detail in the form of SIMD types, without solving the core issue: with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX). It was the main stumbling block for my attempt to introduce "core stream API".

I think a better direction will be to design an API which will not expose the number of parallel blocks at type level altogether. Though it probably will make some things more difficult or slightly less efficient, e.g. stacking MAC + cipher algorithms in one-pass fashion.

Note that ctr keeps only one processed block in its state. I think other crates also should not depend on ParBlocks in their states. It may reduce performance a bit if message is processed in very small chunks, but it will improve state size.

created a very annoying circular dependency between the block-ciphers and stream-ciphers repo.

Currently we have to depend on ctr to reduce code duplication around block-buffer. I think it should be fixed by introducing "core stream API" and adding block-buffer wrapper to cipher. This way aes no longer will depend on ctr, thus removing the circular dependency.

aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.

It's a temporary state since the flavors were introduced relatively recently.

But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.

No, the main benefit is that it makes it much easier for compiler to keep data in XMM registers without spilling them to stack. At this level it can create a noticeable performance regression.

tarcieri commented 3 years ago

I think it needlessly will expose implementation detail in the form of SIMD types

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

I think the best way to avoid that is have native SIMD types that allow e.g. a cipher crate to hand off a SIMD buffer directly to a universal-hash crate.

with runtime detection we can have varying ParBlocks depending on target capabilities (e.g. SSE vs AVX).

By using a trait with a generic parameter, a cipher or UHF type can implement several SIMD buffer types, with the specific one selected at runtime.

If anything, ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

it much easier for compiler to keep data in XMM registers without spilling them to stack.

I think the SIMD buffer types can provide that as well, particularly if we conditionally provide unsafe functions for operating on them with #[target_feature(enable = "...")] annotations for use in SIMD contexts, with a higher-level crate doing runtime detection of the desired SIMD feature combinations.

newpavlov commented 3 years ago

Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.

The compiler is allowed to remove subsequent load/stores. Though it's indeed will not be easy to design API in such way which will allow compiler to reliably do it.

ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.

Can't the same approach be applied to ParBlocks? It also can be a generic parameter instead of associated type.

The problem is that with runtime detection we have two runtime switches in each crate, so compiler can not keep data in registers, since it sees branches on different memory locations, so it can not merge them.

I don't think we can have a reliable solution without an ability to compile dependency crates several times with different feature flags. With such feature we would've been able to push runtime detection as high as possible without sacrificing runtime switch capabilities at lower levels. But unfortunately there is not even pre-RFC for such feature and I think the community and the lang team do not currently have enough interest in improving situation in this area.

tarcieri commented 3 years ago

Can't the same approach be applied tol ParBlocks? It also can be a generic parameter instead of associated type.

Yes, but only if there were a separate trait for parallel operations. It wouldn't make sense to e.g. convert the current BlockCipher::ParBlocks associated type into a generic parameter of BlockCipher.

But then the question remains: what makes for a better SIMD buffer, a GenericArray<u8, _>, or dedicated types which can wrap SIMD registers?

I don't think using a GenericArray for this makes sense for a couple reasons:

As noted above, the cardinality of useful parallel buffer types is presently at 2
In general we have SIMD composability problems because all of the current SIMD code is written at a fairly low level, and I think this is due to a lack of shared SIMD types where we can centralize abstractions

tarcieri commented 3 years ago

I opened https://github.com/RustCrypto/utils/pull/221 which contains a WIP prototype of a simd-buffers crate.

newpavlov commented 2 years ago

I think we can close this issue with cipher v0.4 being released. It uses a different approach than outlined here, but it works good enough.

tarcieri commented 1 year ago

I'm reopening this as we continue to get complaints about performance, and attempting to implement one-pass operation for either AES-GCM or ChaCha20Poly1305 does not yield the expected speedups:

RustCrypto / traits