salsa20: performance optimizations (e.g. SIMD)

There are two big optimizations we could do on both the chacha20 and salsa20 crates.

Avoid recomputing initial state

EDIT: both crates now have a new method to compute the initial state, and separate apply_keystream / generate methods to compute a block

[x] chacha20 crate
[x] salsa20 crate

RFC 8439 Section 3 describes caching the initial block state once computed as a performance optimization:

   Each block of ChaCha20 involves 16 move operations and one increment
   operation for loading the state, 80 each of XOR, addition and roll
   operations for the rounds, 16 more add operations and 16 XOR
   operations for protecting the plaintext.  Section 2.3 describes the
   ChaCha block function as "adding the original input words".  This
   implies that before starting the rounds on the ChaCha state, we copy
   it aside, only to add it in later.  This is correct, but we can save
   a few operations if we instead copy the state and do the work on the
   copy.  This way, for the next block you don't need to recreate the
   state, but only to increment the block counter.  This saves
   approximately 5.5% of the cycles.

SIMD support

Both ChaCha20 and Salsa20 are amenable to SIMD optimizations. We should add SIMD optimizations on x86/x86_64 at the very least.

`x86`/`x86_64`

chacha20
- [x] SSE2 (#61)
- [x] AVX2 (#83, #87)
[ ] salsa20
- [ ] SSE2
  - [ ] AVX2

Other CPU architectures

ARM?

RustCrypto / stream-ciphers

salsa20: performance optimizations (e.g. SIMD) #50

Avoid recomputing initial state

SIMD support

`x86`/`x86_64`

Other CPU architectures

RustCrypto / stream-ciphers

salsa20: performance optimizations (e.g. SIMD) #50

Avoid recomputing initial state

SIMD support

x86/x86_64

Other CPU architectures

`x86`/`x86_64`