There are two big optimizations we could do on both the chacha20 and salsa20 crates.
Avoid recomputing initial state
EDIT: both crates now have a new method to compute the initial state, and separate apply_keystream / generate methods to compute a block
[x] chacha20 crate
[x] salsa20 crate
RFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:
Each block of ChaCha20 involves 16 move operations and one increment
operation for loading the state, 80 each of XOR, addition and roll
operations for the rounds, 16 more add operations and 16 XOR
operations for protecting the plaintext. Section 2.3 describes the
ChaCha block function as "adding the original input words". This
implies that before starting the rounds on the ChaCha state, we copy
it aside, only to add it in later. This is correct, but we can save
a few operations if we instead copy the state and do the work on the
copy. This way, for the next block you don't need to recreate the
state, but only to increment the block counter. This saves
approximately 5.5% of the cycles.
SIMD support
Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on x86/x86_64 at the very least.
There are two big optimizations we could do on both the
chacha20
andsalsa20
crates.Avoid recomputing initial state
EDIT: both crates now have a
new
method to compute the initial state, and separateapply_keystream
/generate
methods to compute a blockchacha20
cratesalsa20
crateRFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:
SIMD support
Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on
x86
/x86_64
at the very least.x86
/x86_64
chacha20
salsa20
Other CPU architectures