Open tarcieri opened 3 years ago
I would love to remove ParBlocks
, but right now I don't see how your proposal will help us to improve our API. If anything, I think it needlessly will expose implementation detail in the form of SIMD types, without solving the core issue: with runtime detection we can have varying ParBlocks
depending on target capabilities (e.g. SSE vs AVX). It was the main stumbling block for my attempt to introduce "core stream API".
I think a better direction will be to design an API which will not expose the number of parallel blocks at type level altogether. Though it probably will make some things more difficult or slightly less efficient, e.g. stacking MAC + cipher algorithms in one-pass fashion.
Note that ctr
keeps only one processed block in its state. I think other crates also should not depend on ParBlocks
in their states. It may reduce performance a bit if message is processed in very small chunks, but it will improve state size.
created a very annoying circular dependency between the block-ciphers and stream-ciphers repo.
Currently we have to depend on ctr
to reduce code duplication around block-buffer
. I think it should be fixed by introducing "core stream API" and adding block-buffer
wrapper to cipher
. This way aes
no longer will depend on ctr
, thus removing the circular dependency.
aes doesn't actually provide the CTR "flavors" (Ctr32BE/Ctr32Le) needed by aes-gcm and aes-gcm-siv.
It's a temporary state since the flavors were introduced relatively recently.
But really, it seems like the main benefit of the implementation in the aes crate is being able to use _mm_xor_si128 to XOR a "i128x8" type.
No, the main benefit is that it makes it much easier for compiler to keep data in XMM registers without spilling them to stack. At this level it can create a noticeable performance regression.
I think it needlessly will expose implementation detail in the form of SIMD types
Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128
/_mm_storeu_si128
calls.
I think the best way to avoid that is have native SIMD types that allow e.g. a cipher
crate to hand off a SIMD buffer directly to a universal-hash
crate.
with runtime detection we can have varying
ParBlocks
depending on target capabilities (e.g. SSE vs AVX).
By using a trait with a generic parameter, a cipher or UHF type can implement several SIMD buffer types, with the specific one selected at runtime.
If anything, ParBlocks
is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks
for a given type.
it much easier for compiler to keep data in XMM registers without spilling them to stack.
I think the SIMD buffer types can provide that as well, particularly if we conditionally provide unsafe
functions for operating on them with #[target_feature(enable = "...")]
annotations for use in SIMD contexts, with a higher-level crate doing runtime detection of the desired SIMD feature combinations.
Without SIMD types that can be used between crates, we wind up round tripping data to byte arrays in the form of a bunch of intermediate _mm_loadu_si128/_mm_storeu_si128 calls.
The compiler is allowed to remove subsequent load/stores. Though it's indeed will not be easy to design API in such way which will allow compiler to reliably do it.
ParBlocks is worse in this regard, since in its current form as an associated type, there can only be one ParBlocks for a given type.
Can't the same approach be applied to ParBlocks
? It also can be a generic parameter instead of associated type.
The problem is that with runtime detection we have two runtime switches in each crate, so compiler can not keep data in registers, since it sees branches on different memory locations, so it can not merge them.
I don't think we can have a reliable solution without an ability to compile dependency crates several times with different feature flags. With such feature we would've been able to push runtime detection as high as possible without sacrificing runtime switch capabilities at lower levels. But unfortunately there is not even pre-RFC for such feature and I think the community and the lang team do not currently have enough interest in improving situation in this area.
Can't the same approach be applied tol
ParBlocks
? It also can be a generic parameter instead of associated type.
Yes, but only if there were a separate trait for parallel operations. It wouldn't make sense to e.g. convert the current BlockCipher::ParBlocks
associated type into a generic parameter of BlockCipher
.
But then the question remains: what makes for a better SIMD buffer, a GenericArray<u8, _>
, or dedicated types which can wrap SIMD registers?
I don't think using a GenericArray
for this makes sense for a couple reasons:
I opened https://github.com/RustCrypto/utils/pull/221 which contains a WIP prototype of a simd-buffers
crate.
I think we can close this issue with cipher v0.4
being released. It uses a different approach than outlined here, but it works good enough.
I'm reopening this as we continue to get complaints about performance, and attempting to implement one-pass operation for either AES-GCM or ChaCha20Poly1305 does not yield the expected speedups:
Taking a step back from https://github.com/RustCrypto/traits/pull/354, I thought it'd be good to look how and where ILP and SIMD parallelism is currently used across the project as a whole, and how that could be improved.
The only place we presently have any sort of parallelism abstraction at the trait-level is
BlockCipher::ParBlocks
. Otherwise various crates leverage e.g. SIMD internally. RegardingBlockCipher::ParBlocks
specifically, the only crate that leverages it is theaes
crate.The following crates have SIMD backends:
Ciphers
aes
chacha20
UHFs/"MACs"
polyval
poly1305
AEADs
In AEADs, we'd like to glue the above crates together in fairly fixed combinations in order to leverage ILP, passing SIMD buffers from ciphers to UHFs for authentication:
aes-gcm
/aes-gcm-siv
:aes
+ghash
/polyval
chacha20poly1305
:chacha20
+poly1305
(also
aes-siv
andpmac
, but this is less of a priority)In either of these cases there's a single specific buffer type I think it'd be nice for both the cipher implementation and UHF to support in common:
aes-gcm
/aes-gcm-siv
: "i128x8" i.e.[__m128i; 8]
on x86/x86_64chacha20poly1305
: "i256x4" i.e.[__m256i; 4]
on x86/x86_64Concrete proposal
My suggestion is to get rid of
BlockCipher::ParBlocks
and replace it with more general SIMD types and traits designed to work with them, namely:utils
crate e.g.simd-buffers
which provides "i128x8" and "i256x4" SIMD buffer types which are backed by__m128i
/__m256i
on x86/x86_64 and otherwise provide a portable implementation. These types don't need to implement any sort of arithmetic, just provide wrappers for passing data between SIMD implementations.cipher
anduniversal-hash
which operate on SIMD buffers.aes-gcm
,aes-gcm-siv
, andchacha20poly1305
cipher
API suggestionI'd suggest adding traits to
cipher
which use the SIMD buffer types which are useful for both block ciphers and stream ciphers.I also think it might make sense to use a generic parameter rather than an associated type to permit support for multiple buffer types (e.g. on newer CPUs, "i128x4" might be a better option for AES, but we can support both):
universal-hash
API suggestionSIMD
ctr
supportTrying to move the end-user facing
aes-ctr
types intoaes
has created a very annoying circular dependency between theblock-ciphers
andstream-ciphers
repo. Furthermore,ctr
is quite a bit more general now than what the CTR types in theaes
crate provide, and alsoaes
doesn't actually provide the CTR "flavors" (Ctr32BE
/Ctr32Le
) needed byaes-gcm
andaes-gcm-siv
.But really, it seems like the main benefit of the implementation in the
aes
crate is being able to use_mm_xor_si128
to XOR a "i128x8" type.If we had
BlockEncryptPar
andStreamCipherPar
traits, thectr
crate could glue the two together, accepting a SIMD buffer as input, computing the next buffer of keystream output, and XORing the latter into the former. This would allowctr
to be generally SIMD optimized, and also mean we only have onectr
implementation to worry about instead of a separate one in the AES crate.