`chacha20` is missing `.zeroize()` for the SIMD backends

nstilt1 commented 8 months ago

I've noticed that #333 is missing zeroize for the SIMD backends, and that the zeroize crate seems to support SIMD registers. There are 2 ways that I can identify for incorporating zeroize. Both methods, however, would require the MSRV to be increased to 1.60.

Method 1

The first method is kind of easy, as it requires a relatively small amount of code, but it is a little inefficient. Basically, .zeroize() could be called on the SIMD results arrays, as well as the state arrays after generating results.

Pros:

it should successfully zeroize the SIMD registers

Cons:

every time the SIMD backend generates either a block or PAR_BLOCKS blocks of output, it will need to zeroize the SIMD registers

Method 2

This would involve a little bit of a reimplementation of some features that chacha20 previously had (persisting Core structs via autodetect.rs and backend.rs). The persisting Core structs can provide a few benefits:

they should only be initialized once
.zeroize() could only be applied once to the SIMD registers, instead of every time the Core generates results
the RNG shouldn't need unsafe fn generate(&mut self, dest_ptr: *mut u8, num_blocks: usize) to achieve a performance that is comparable with .apply_keystream() on AVX2... unless .apply_keystream()'s performance also increases by 5-7%. The RNG could still benefit from using a pointer though.

Cons:

a little more code would be required

Here's a link to v0.8.1 for reference. I will need it if I will be adding the functionality back: https://github.com/RustCrypto/stream-ciphers/blob/338c078d731692fba3b8256e45de2c3e334d46d8/chacha20/src/backend.rs

nstilt1 commented 8 months ago

I've begun working on the second option... however... with the 1.60 MSRV, Rudra might not work on it since it is on nightly 1.58. I'll probably add the zeroize functionality last.

tarcieri commented 8 months ago

We're about to start making breaking changes to all of the crates in this repo. I think it's fine for your PRs to assume that and an MSRV bump.

nstilt1 commented 8 months ago

I've made some progress with the second option, but even before adding zeroize, having a union for the x86 backends seems to cause avx2 performance to decrease by about 15%. I've got 4 branches where I tried this, and only 1 branch operates at less than 1.0 cpb, but it looks nasty and I don't endorse that branch. Also, I've adjusted the newer branches somewhat so that ChaChaCore<R, V> lives in autodetect.rs, so no need to worry about that aspect.

What the branch that has decent performance does:

the Backend struct stores results and a block index so that it never has to regenerate results for the same block_pos. With the way that the avx2 backend works, it is very difficult to make the extraction pretty. There might be a way to make it prettier... but is it worth it given the drawbacks (see below)

Some drawbacks of the backend_union branches:

they require some backend methods to manage the state (ie get/set_block_pos(), get/set_nonce())
they currently fail some avx2 rng.rs and cipher tests, but the only tests they fail are the ones that involve get_block_pos(), and this is probably because of how the state increments and how it is managed.
They are generally slower than my pointers branch, and they're all based on the pointers branch so that I don't need to copy/paste a bunch of code... and this is still without adding in zeroize. I have no idea why backend_union_2,3,4 are 15+% slower.

If you have any ideas for improving backend_union or backend_union_2, I'm open to suggestions. backend_union_3 and 4 just tried to remove Backend::results and pass a temporary variable into rounds, but the performance did not change. Otherwise, I've got a proposal:

Proposal

Mayhaps we could use pointers, and it might be better to have an unsafe write_ks_blocks(&mut self, dest_ptr: *mut u8, num_blocks: usize, results_buffer: &mut Self::Results). Another improvement with the functionality of write_ks_blocks() would be if was capable of generating more than 4 blocks using a while loop.

With a results_buffer parameter, the backends could reuse the same results_buffer and call .zeroize() when the methods are finished with it. In each SIMD backend, gen_ks_block and gen_par_ks_blocks` both currently fill the same type of buffer.

cipher might benefit from taking advantage of this and changing a little. Also, I've taken a peek at inout_buf, and there's a slight chance that... just maybe... inner could pass a null_ptr to write_ks_blocks(), and when write_ks_blocks() receives a null_ptr, it could simply overwrite results_buffer instead of copying it to the pointer. Then cipher could use the pointer to results_buffer to xor it with the data. I don't know if inout_buf would work with this... but it would be kinda cool if it did. (EDIT): darn... I forgot that the avx2 implementation doesn't store the blocks sequentially. So that would not work.

The rng_inner() method could look something like this:

pub(crate) unsafe fn rng_inner<R>(state: &mut [u32; STATE_WORDS], mut dest_ptr: *mut u8, num_blocks: usize)
where
    R: Rounds,
{
    let mut backend = Backend::<R>::new(state);
    // replace with some generic buffer initialization?
    let mut results: [[__m256i; 4]; N] = [[_mm256_setzero_si256(); 4]; N]; 

    backend.write_ks_blocks(dest_ptr, num_blocks, &mut results);
    // replace this with a generic method?
    state[12] = _mm256_extract_epi32(backend.ctr[0], 0) as u32;

    #[cfg(feature = "zeroize")]
    {
        backend.zeroize();
        results.zeroize();
    }
}

I feel like this could be okay. I could go ahead and bench this with zeroize, but I have a feeling the performance won't be as bad as backend_union_2/3/4. I'll also see about updating /benches for those branches if you want to be able to compare them quantitatively.

nstilt1 commented 8 months ago

I benched a new branch (zeroize_simd) that essentially goes with the first option, just zeroizing after generation, and the fill_bytes() performance for avx2 ranged from 1.01 to 0.99 cpb, which beats the 3 neater backend_union_X branches. It does not beat the nasty backend_union branch. That method might look a little better with recursion. Will see what I can do.

nstilt1 commented 8 months ago

Had a bunch of benchmarks here, but TL; DR: option 2 is more desirable now that it is working

nstilt1 commented 8 months ago

Alright. Sorry for my complaining. I just didn't like that the first attempt at Option 2 resulted in Cipher's 1.6 cpb performance, even though it used pretty much the exact same code as before.

I've been working on backend_union_update_state a little more. I was able to fix the RNG's get_word_pos() issues, and now the Cipher is still failing seek tests. Will hopefully have fixed it by tomorrow

newpavlov commented 8 months ago

I wrote about it previously in different issues, but I think that in the case of "flat" types (i.e. types which do not reference "outside" memory) we can use the following implementation:

impl Drop for Foo {
    fn drop(&mut self) {
        let n = core::mem::size_of::<Self>();
        unsafe {
            core::ptr::write_bytes(self, 0, n);
            // blackbox `asm!`
        }
    }
}

nstilt1 commented 8 months ago

Would that be suitable for a ChaChaCore struct that contains a union? I've added the ZeroizeOnDrop code, but judging by the looks of your suggested implementation... it would be a lot less code than having to determine which part of the union is being used.

And what of the ManuallyDrops in the union? Is it necessary to call ManuallyDrop::drop() on the union field that is in use? Or maybe even just calling it on any field since they should occupy the same memory?

newpavlov commented 8 months ago

ManuallyDrops are required by current implementation of union. IIRC we do not have any actual Drops on variants.

nstilt1 commented 7 months ago

Your ZeroizeOnDrop implementation seems to be far superior to a regular implementation of ZeroizeOnDrop. I've gone ahead and cleaned up my working branch a little and got it to pass some tests, but I'm not sure if I would be able to add that ZeroizeOnDrop impl on my own. I could either try to merge that branch with #333 or I could make a separate PR

nstilt1 commented 3 months ago

Even though I made code for this issue, it seems that it would be a wasted effort to rework the backends given that typical constructor methods result in a stack-allocated structure, such as any constructor that ends with:

Self {
...
}

With constructors returning that and trying to run let mut private_struct = Box::new(SomeCryptoStruct::new(...)) would likely result in a stack-allocated structure being copied/moved onto the heap, rather than allocating it on the heap... meaning it might be pointless to make ChaChaCore own its temporary buffers in an attempt to be OCD about zeroizing data.

There is a way to ensure that all allocated data stays on the heap using a type and macro such as

#[cfg(feature = "alloc")]
type CfgBoxed<T> = Box<T>;
#[cfg(not(feature = "alloc"))]
type CfgBoxed<T> = T;

/// Defines a new instance of a data structure that is conditionally on the heap, based on whether the `alloc` feature is enabled.
#[macro_export]
#[cfg(feature = "alloc")]
macro_rules! cfg_new_boxed {
    ($data:expr) => {
        $crate::Box::new($data)
    };
}

/// Defines a new instance of a data structure that is conditionally on the heap, based on whether the `alloc` feature is enabled.
#[macro_export]
#[cfg(not(feature = "alloc"))]
macro_rules! cfg_new_boxed {
    ($data:expr) => {
        $data
    };
}

// and then use it in a constructor like so
pub struct Test {
  a: u64,
  b: [u32; 16]
}
impl Test {
  pub fn new(value: &u64) -> CfgBoxed<Self> {
    let mut result = cfg_new_boxed!(Self { a: 0, b: [0u32; 16] } );
    result.a = *value;
    result
  }
}

While this could work, every crypto crate would "need" to implement these types of constructor methods... but it would kind of be a waste of time because a) literally every crypto crate would "need" to do something like this and b) a better solution would be proper stack bleaching. Performance-wise, and the result would be better. It would be especially better if it was supported natively with the LLVM and Rust, like that old RFC suggests.

Part of the reason I wanted to consider this route is because the eraser crate, as I understand it, runs functions on the heap. This route would probably be a little better than running functions on the heap, aside from the sheer number of crates that would "need" to be modified.

I'm fine if we close this issue—not all code is meant to make it to production. But if y'all somehow would still like code from the old branch I can see about working it into a new branch based on the current repo.

tarcieri commented 3 months ago

The most straightforward way to impl it would be to add zeroize-gated Drop impls on any relevant structs in chacha20::backends::* which take care of clearing out the intermediate state.

That wouldn't wipe all of the state that's left over on the stack, but that's not something we generally do for any of our cryptographic implementations.

a better solution would be proper stack bleaching

Yep

newpavlov commented 3 months ago

As I wrote above, I think a better solution will be to use the zeroize_flat_type function. Unfortunately, it was released in v1.8.0 which got yanked because of unrelated changes. Maybe we should release v1.7.1 with it?

tarcieri commented 3 months ago

Yeah, I've been meaning to redo the zeroize release with an optional simd feature which avoids the MSRV jump. Hopefully this weekend.

tarcieri commented 3 months ago

zeroize v1.8.1 is out with zeroize_flat_type: https://docs.rs/zeroize/1.8.1/zeroize/fn.zeroize_flat_type.html

newpavlov commented 3 months ago

I started to implement this, but in the process I reconsidered it and now agree with the @nstilt1 comment above.

The backends live only on stack and no different from any other data spilled on stack. As we discussed in the traits issue, we do not provide any guarantees for spilled data (though we try to minimize spillage amount if possible) and zeroization (especially with its current implementation in zeroize) can negatively impact performance not only by doing unnecessary writes, but also by inhibiting optimizations.

So I think we can close this issue as "not planned".

RustCrypto / stream-ciphers