RustCrypto / stream-ciphers

Collection of stream cipher algorithms
253 stars 49 forks source link

salsa20: performance optimizations (e.g. SIMD) #50

Open tarcieri opened 5 years ago

tarcieri commented 5 years ago

There are two big optimizations we could do on both the chacha20 and salsa20 crates.

Avoid recomputing initial state

EDIT: both crates now have a new method to compute the initial state, and separate apply_keystream / generate methods to compute a block

RFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:

   Each block of ChaCha20 involves 16 move operations and one increment
   operation for loading the state, 80 each of XOR, addition and roll
   operations for the rounds, 16 more add operations and 16 XOR
   operations for protecting the plaintext.  Section 2.3 describes the
   ChaCha block function as "adding the original input words".  This
   implies that before starting the rounds on the ChaCha state, we copy
   it aside, only to add it in later.  This is correct, but we can save
   a few operations if we instead copy the state and do the work on the
   copy.  This way, for the next block you don't need to recreate the
   state, but only to increment the block counter.  This saves
   approximately 5.5% of the cycles.

SIMD support

Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on x86/x86_64 at the very least.

x86/x86_64

Other CPU architectures

tarcieri commented 4 years ago

Changed topic to salsa20 as chacha20 is now optimized on x86.

chacha20 could still use e.g. NEON acceleration.