chacha20: returning rand_core feature

nstilt1 commented 10 months ago

I borrowed some code from version 0.8.1 of chacha20, as well as rand_chacha and was able to get it to compile and pass tests. The main issues in my code that I identified were:

usage of [u8; 12] instead of u128 for set/get_stream(). It used to have a u64 when it had a 64 bit counter
- This has been fixed, set_stream() takes the lower 96 bits of a u128 input. It now has support for using a [u8; 12] with set/get_stream_bytes()
usage of ParBlock<LesserBlock> as a temporary output buffer prior to copying it into results in fn generate(). It will probably leave a copy of itself in memory after use. It might be possible to change the BlockRngResults type to ParBlock<LesserBlock>
- This has been fixed, although it uses a little bit of "unsafe" code to treat the 2D array as a 1D array in
- AsRef<[u32]> for BlockRngResults and
- AsMut<[u32]> for BlockRngResults
when calling set_word_pos() with an input that wasn't divisible by 16, get_word_pos() would round the input down to the nearest multiple of 16.
- This has been fixed. It looks like I missed something from rand_chacha.

I've also added ZeroizeOnDrop for the BlockRngResults.

https://github.com/rust-random/rand/issues/934

nstilt1 commented 10 months ago

Alright, I think I've added just about all I can think of. The #[cfg_attr... in lib.rs and some impls in rng.rs are there because rand/src/std.rs has a

#[derive(Clone, Debug, PartialEq, Eq)]
pub struct StdRng(ChaCha12Rng);

There are also some functions in rng.rs that don't use #[inline]. I'm not sure if they need it or not. Do you have any questions, suggestions, or feedback? Thanks for your time.

tarcieri commented 10 months ago

I'd suggest less unsafe code and more of a KISS implementation in an initial PR, and then circling back on some of it in a followup so it can be considered separately.

nstilt1 commented 10 months ago

Alright. Switching the safety on

nstilt1 commented 10 months ago

I was able to use generics to ensure that the method inputs impld Into<{WrapperName}>, and then called .into() internally. Now the inputs should zeroize on drop when the zeroize feature is enabled, and end users will not need to use .into() on the input parameters. I was also able to remove all of the *_bytes() methods I had implemented and added impls for From<[u8; x]>.

The only final issue I ran into was for a way to convert a [u8; 12] to a [u8; 16] (or a u128 to a [u8; 12]) without producing an un-zeroizeable copy, so I added an impl for From<&mut [u8; 12]>. It could alternatively be done with an additional wrapper for [u8; 12]. Realistically, this impl is unnecessary, but I was using the set_stream_bytes() method in a test and it was somewhat more convenient for the test to use a [u8; 12]. The From<&mut [u8; 12]> is both unncecessary and less efficient and could be removed if you want.

Are there any additional changes that need to be made?

nstilt1 commented 10 months ago

I think that is all I want to change. There is some stuff in rng.rs, particularly the Zeroizing structs that look a little bit... polluting. I could move them into a different file if needed, or it might be possible to use a generic struct or two. I probably won't be making any more changes until I receive feedback... although, I might add a test to make sure that the inputs actually get zeroized on drop. I'm also not sure if the methods with alternate inputs really need a comment about the performance—the difference is probably minimal, and it won't even affect the rate of the output.

Thanks for your time, for this wickedly fast ChaCha20 cipher implementation, as well as the other rand and c2-chacha authors and contributors.

nstilt1 commented 10 months ago

I was able to fix the zeroize test, but the user would need to pass an &mut [u8; 32] (or worse, (&mut [u8; 32]).into()) for the original to be zeroized. It appears that we can only zeroize the copies we receive :/, but I feel like the way it is called now (which is the way it was before) is acceptable. Eg:

let mut seed: [u8; 32] = /* arbitrary value */;
let mut rng = ChaCha20Rng::from_seed(seed);
seed.zeroize();

Some questions that may need consideration:

Do we want to or need to re-export rand_core in lib.rs?
Previously, this crate was working in a way where rand_core was the only dependency for the random number generator (https://github.com/rust-random/rand/issues/872#issuecomment-575242181). Do we still want to reduce the dependencies for the RNG and/or bypass the write_keystream_blocks() method used in fn generate(), since the underlying code is almost entirely within chacha20?
How does the code need to be tidied up? Should I move the input parameter wrappers and their impls to another file, or is it acceptable how it is?
Is there any unnecessary code? Perhaps the AlteredState trait, or the from_seed method that I modified from new() in lib.rs, or some of the zeroizing code?
Is there a better way to impl ZeroizeOnDrop for inputs rather than with wrappers and duplicated spaghetti code to determine if zeroize is enabled? It would be somewhat convenient if it could be done with something like this, within the method
```
impl SeedableRng for $ChaChaXRng {
type Seed = [u8; 32];

#[inline]
fn from_seed(seed: Self::Seed) -> Self {
    #[cfg(feature = "zeroize")]
    seed.zeroize_on_drop();
    let core = $ChaChaXCore::from_seed(seed);
    Self {
        rng: BlockRng::new(core),
    }
}
}
```
It would just be a little neat if it could be done with a method call that made no wrappers, no copies, didn't change the value's type, and didn't require the variable to be mutable. It seems kind of impossible without some kind of change to Rust though. Might be easier to just make SeedableRng allow a Seed of type &[u8; 32] rather than doing a bunch of under-the-hood zeroizing to make sure that it zeroizes.

nstilt1 commented 10 months ago

I'll go ahead and supply the rationale for AlteredState. If we were to use any unsafe code to interpret [u8] as a [u32], it could mostly (if not completely; depends on the implementation) be contained within AlteredState. That is pretty much the only benefit, and it honestly doesn't have much of a purpose when there is only safe code. Since the methods of AlteredState are only used once and there is only safe code, it wouldn't hurt to just copy over the methods to where they are called.

tarcieri commented 10 months ago

@nstilt1 you seem to have checked in a .DS_Store file.

(You can configure an excludesfile in ~/.gitconfig to avoid this in the future)

tarcieri commented 10 months ago

255b481 seems to have added back .DS_Store

nstilt1 commented 10 months ago

Now that we have the fill_bytes method at our fingertips, I will look at .apply_keystream() and see if I can reduce some of the overhead that fill_bytes has.

I am working on it in a separate branch. Will write some tests using the output of the current implementation.

nstilt1 commented 10 months ago

Regarding the last commit, I noticed an "inconsistency" with the index being used by set_word_pos() being 4 bits when the full index is 6 bits. I attempted to "fix" that, but then I realized that it was being done that way for alignment purposes. Will try to align it again with minimal changes.

tarcieri commented 10 months ago

Something I probably should've mentioned much earlier, but it would be helpful if the implementation didn't use the cipher API so that cipher could be made an optional dependency (again, as it was in chacha20 v0.8)

That would greatly reduce the number of crates needed for the rand_core/rand_chacha-replacement use case.

nstilt1 commented 9 months ago

Something I probably should've mentioned much earlier, but it would be helpful if the implementation didn't use the cipher API so that cipher could be made an optional dependency (again, as it was in chacha20 v0.8)

That would greatly reduce the number of crates needed for the rand_core/rand_chacha-replacement use case.

I can give that a try, and I'll take a gander at v0.8. I'll let you know if I run into any problems.

nstilt1 commented 9 months ago

@tarcieri Do you want the code for the ChaCha cipher to be pretty much exactly the same as it was in v0.8.1, including the code in src/backend/ and src/backend.rs? I just don't want to un-fix anything that may have been fixed in the current version, and if you don't want it completely reverted, would we still need to add src/backends/autodetect.rs and to have the ChaCha Core struct in there? Alternatively, the process_with_backend() method might be able to be converted to accept inputs that don't depend on cipher, or possibly with no inputs and it could just be a generate method that calls gen_par_ks_blocks() where possible, and it might not be desirable if there are use-cases where people wouldn't want it to generate the blocks in parallel. Also, as for the 32 vs 64 bit counter, it would be easier to have that functionality by reverting instead of reworking process_with_backend() since the previous version already supports it.

In the meantime, I will be transforming process_with_backend() to a method called generate() that overwrites the 256-byte ChaChaCore buffer using the backend, then moving some things around based on v0.8.1 to separate the dependent code. Let me know if there is something else you would prefer.

Here's a link to v0.8.1 in case you don't have the link handy: https://github.com/RustCrypto/stream-ciphers/blob/338c078d731692fba3b8256e45de2c3e334d46d8/chacha20/src/backend/autodetect.rs

nstilt1 commented 9 months ago

Okay... so I was able to get the RNG to work using generate(), and I have 2 different versions in different branches. The optional_cipher branch fills a [u32; 64], and the unsafe_fill_bytes branch passes a pointer to generate() so it can directly write the results to the input slices of fill_bytes(). Both of these have about the same performance at about 0.994-1.001 cpb, so either one could be used for the RNG. The primary benefit of the pointer is that it works on a [u8; 256] as well as a [u32; 64] and it works on a &[u8]. And on paper, it should be faster for fill_bytes()... but it doesn't appear to be.

The drawback of both of these generate methods is that without the process_with_backend() method, all of the StreamCipher, StreamCipherCoreWrapper, StreamCipherSeek, etc methods would need to be reimplemented using a macro and the original performance of the cipher on avx2 was 0.93-0.96 cpb, while the current performance is about 1.01-1.2 cpb.

I think a valid solution for the cipher's performance would be to conditionally compile the process_with_backend() and generate() methods, along with the backend methods it uses so that it will hopefully be the same as before. Will give it a try, but it might look a bit messier than just having one method for both implementations to generate values. Will attempt to use macros so that each method is only written once in src/backends/

Edit: It does not seem to be possible to write a macro that can make a function with varying arguments, as well as with a varying block of code in the function. Will just conditionally compile the inner function and their associated methods for now.

nstilt1 commented 9 months ago

Besides the currently failing checks, the main issues remaining are:

64-bit counter for Legacy, ~and maybe XChaCha since it uses an 8-byte Nonce internally~
~Potential issue for 32-bit machines: StreamCipherCore's remaining_blocks() method returns an Option<usize>. usize is 4 bytes on 32-bit machines and with a 64-bit counter, this method could fail to produce an accurate value, or it could throw an error, or maybe it just won't compile. I don't knowingly use a 32-bit computer, so I don't know exactly what would happen.~
- stream_core.rs simply states to return None when the value is too large to fit in a usize
The organization of lib.rs regarding the cipher-dependent code. I think generate and process_with_backend should probably remain in lib.rs due to the use of the cpuid types. I'm not entirely sure how to use those outside of lib.rs where they seem to be defined. The Rounds structs/trait could also be moved into its own file like it was previously.

Something that could be done regarding alignment on 64-bit machines:

~it probably shouldn't be too complicated to have a [u32; 64] buffer for 32-bit machines and a [u64; 32] buffer for 64-bit machines for the RNG. I don't know if that would help with performance any, especially if we were to use a pointer for generate(). But I feel like this might require minimal changes. I can take a look once all of the aforementioned issues are addressed or accepted.~
- I thought that I saw somebody mention something like this in one of the issues that I read, but I could not find the original comment. It probably is unnecessary given that the current performance is decent, and the only way for a minimal solution to work would be if pointers were used, which would kind of make the type of buffer used unimportant since fill_bytes() skips the buffer for most of the fill.

~Something that could be interesting (although not necessary), would be a way for users to define their own ChaCha variants using exported macros, such as ChaCha22Rng, or perhaps using combinations with Legacy and IETF in the macro call. It wouldn't take much, although there would need to be a unique name input for the Rounds struct when it is defined since it does not appear to be possible to concatenate names for structs in a macro. We could also gate the macros behind a feature.~

tarcieri commented 9 months ago

XChaCha since it uses an 8-byte Nonce internally

XChaCha is based on the IETF construction internally, unless you're referring to a different XChaCha.

Something that could be interesting (although not necessary), would be a way for users to define their own ChaCha variants using exported macros, such as ChaCha22Rng

There are already far too many variants of ChaCha. We don't need any more.

ChaCha20 already used too much diffusion and is needlessly slow because of it. It would've been better standardized as ChaCha12. We certainly don't need any variants that run slower while providing no additional practical security value.

nstilt1 commented 9 months ago

XChaCha since it uses an 8-byte Nonce internally

XChaCha is based on the IETF construction internally, unless you're referring to a different XChaCha.

In variants.rs, I have a Variant trait that specifies the starting index of the nonce that is fed into ChaChaCore<R,V>::state: [u32; 16]. Currently, the starting index of the nonce is 14 for XChaChaVariant, while Ietf uses 13 as the start index. XChaCha passes the original tests with the start index at 14, fails tests at index 13, and the Ietf variant passes tests using 13 as the start index. There could be something wrong with the tests, and I also noticed that one of the test vector links for XChaCha returns a 404, and it is not indexed by the Wayback Machine. Some further evidence involves the XChaCha initialization code which uses XChaChaNonce[16..] of a 24-byte array for the nonce that it passes to ChaChaCore<R,V>.

https://github.com/RustCrypto/stream-ciphers/blob/3a53d4172a4aa6cf05cfbe455ea851d8072f05d9/chacha20/tests/mod.rs#L97C9-L97C87 The direct link in question: https://tools.ietf.org/id/draft-arciszewski-xchacha-03.html#rfc.appendix.A.3.2

I just think that if XChaCha were to use the IETF construction, it would pass a 12-byte nonce to ChaChaCore<R,V>::state[13..16] instead of an 8-byte nonce to ...[14..16]. And this isn't exactly a problem, but it is a minor discrepancy.

I will double check this other link for the paper, but I briefly looked at it and it does specify that it uses the IETF variant with a 96-bit nonce rather than the 64-bit nonce. It should have some test vectors to compare with. https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03

tarcieri commented 9 months ago

Here's a working link:

https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03

Note that we're building upon uses the IETF's ChaCha20 (96-bit nonce), not Bernstein's ChaCha20 (64-bit nonce).

nstilt1 commented 9 months ago

I found the documentation that causes this. I'm sure you were already aware of this lol, but I'm not entirely sure why it doesn't use the full 12 bytes by extending the XChaCha nonce to 28 bytes (or something). Perhaps it came from XSalsa, and maybe to be compatible with a 64-bit counter... but I don't know why it uses a "96-bit nonce" if 32 of those bits are 0s. I don't know. It seems to be fairly secure the way it is, but those bits could be prime bits of real estate for extending the nonce just a little bit more.

[2](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2).  AEAD_XChaCha20_Poly1305

   XChaCha20-Poly1305 is a variant of the ChaCha20-Poly1305 AEAD
   construction as defined in [[RFC7539](https://datatracker.ietf.org/doc/html/rfc7539)] that uses a 192-bit nonce
   instead of a 96-bit nonce.

   The algorithm for XChaCha20-Poly1305 is as follows:

   1.  Calculate a subkey from the first 16 bytes of the nonce and the
       key, using HChaCha20 ([Section 2.2](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2.2)).

   2.  Use the subkey and remaining 8 bytes of the nonce (prefixed with
       4 NUL bytes) with AEAD_CHACHA20_POLY1305 from [[RFC7539](https://datatracker.ietf.org/doc/html/rfc7539)] as
       normal.  The definition for XChaCha20 is given in [Section 2.3](https://datatracker.ietf.org/doc/html/draft-arciszewski-xchacha-03#section-2.3).

I'll see about getting the legacy counter back up to 64 bits, and I'll make a pull request for cipher afterwards to change the remaining_blocks() return type to an Option<u64> if that is acceptable. I don't know if any stream ciphers could have more than 2^64 remaining blocks, but that is the current 64-bit limit of usize. There are 131 reverse dependencies for cipher, so if changing that return type isn't acceptable, it might also be possible to conditionally compile the counter type based on 32-bit or 64-bit target_pointer_width. Having a 32-bit counter on 32-bit systems would still behave differently though, which would probably be unacceptable if we wanted to offer consistency/compatibility across different platforms.

nstilt1 commented 9 months ago

Regarding the last commit... I wasn't trying to be salty or anything... I'm a little OCD and my code was technically incorrect. I don't mind which version of XChaChaVariant that the library uses, but if my own code (with the nonce start index at 14) confused me, it could confuse the next new contributor. I'm okay with either version, but I suppose a comment could have sufficed.

Also, I recant the 32-bit usize issue. Didn't see the comment in stream_core.rs about it.

I've begun working on the counter in my counter_support branch, and I'm having some issues. I don't have much experience with SIMD, but I have absolutely 0 experience with NEON and I don't see any tooltips in VS Code for it. If I'm still stuck on this after another week or so, then I might need some help. Or if the 64-bit counter isn't a priority, that is okay with me as well.

tarcieri commented 9 months ago

It would probably be good to split the changes to return the rand_core feature and the changes to support a 64-bit counter into separate PRs.

(I should also note the 64-bit counter is something of a niche use case where it would be nice to avoid undue complexity elsewhere)

nstilt1 commented 9 months ago

Okay. Some of the checks that are failing seem to have started failing after I split up the library into the various features, and I don't know what to do about them, or if the current features are the ones that you want present. I am unsure how to proceed with those failing checks, particularly cargo build --target thumbv7em-none-eabi

Also, I've still got to replace the impl_zeroize_on_drop and impl_zeroize_from macros from rng.rs since they are only used once.

tarcieri commented 9 months ago

Looks like a doctest failure. Surprised minimal-versions is the only thing catching it

nstilt1 commented 9 months ago

That last "fix" got it to pass on my end, but should those docs just go in chacha.rs? I am unfamiliar with the inner workings of rust docs.

tarcieri commented 9 months ago

3f82000 is fine though we usually use a slightly different pattern for that: https://github.com/RustCrypto/formats/blob/961e756/pem-rfc7468/src/lib.rs#L23-L24

nstilt1 commented 9 months ago

I believe it is mostly ready. Some possible things that are missing:

ChaCha20LegacyCore
- could be added with pub type ChaCha20LegacyCore = ChaChaCore<R20, Legacy>;
XChaChaCore
- It might need a wrapper, or we could make XChaCha8Core, XChaCha12Core, XChaCha20Core. Or we could expose the Rounds structs as well as the variants.rs structs
Benches
- Will be adding some benches, and I'll try to organize them a little bit better. I noticed that criterion-cycles-per-byte is supposed to work on aarch64 with Linux, so I can boot up my Raspberry Pi 4 and try to bench it later today. Since the benchmarks will be different on aarch64-linux and aarch64-macOS, I'll see about putting them in
```
benches/
bench_results/
aarch64-linux/
aarch64-macOS/
x86_64-linux/
```
  Let me know if any other changes need to be made.

nstilt1 commented 9 months ago

That should be all of the instructions for getting cpb counts on a Raspberry Pi 4b.

I still don't know why fill_bytes() is slower than apply_keystream() on x86_64. It uses basically the same underlying code, where apply_keystream() does some extra steps. I've tried using a [u32; 64] instead of *mut u8, but the performance was roughly the same at about 0.99 cpb. I also benched its branch again just now to be sure, and it currently runs at 1.05 cpb.

nstilt1 commented 9 months ago

I've just removed the duplicated gen_par_ks_blocks() in avx2.rs/neon.rs/sse2.rs so that those methods now just pass a pointer to the rng's pointer version. Somehow, it was a hair faster on macOS. I figure it would be easier to maintain if there was as little duplicated code as possible. The only remaining duplicated code in those files would be the inner and rng_inner methods. But they seem small enough to be okay.

nstilt1 commented 9 months ago

Classic big endian vs little endian error :(

Not sure how the rng tests passed, unless they need to be in chacha20/tests

nstilt1 commented 9 months ago

There is now minimal duplicated code in the individual backend files. One thing I noticed regarding generate() though, is that none of the pointer methods in the backends guarantees that it is writing to a destination with 256 bytes to spare, that it is up to the programmer to determine if there are 256 bytes to spare before calling the method. The soft.rs function's inner code was also completely within an unsafe block. My questions regarding generate(): 1) Should soft.rs's rng_gen_ks_blocks() and write_ks_block be unsafe fns? 2) Should generate() be an unsafe fn if all of the branches were to lead to unsafe function calls? 3) is the first assert necessary: assert!(!dest_ptr.is_null()); This is my first time using pointers in Rust, but if the pointers are all created internally without any code that explicitly creates a null pointer, then it might not be possible for a null pointer to get passed to generate(). Chat GPT suggested it, but I've noticed that Chat GPT consistently suggests unnecessary and unoptimized code.

nstilt1 commented 9 months ago

I suppose I could mark it unsafe and add safety notes similar to cipher's safety notes. Will update the comment in variants.rs as well, but the variants.rs file will be more important once the 64-bit counter has been added. After that, I don't think I have anything left to change until I receive feedback. Feel free to take it from here. Just trying to minimize the remaining work to be done

nstilt1 commented 9 months ago

hmm... looks like commits with backticks omit the text between them. I removed a nasty looking bit op that discarded the last 8 bits since I literally bit shifted right by 8 immediately afterwards. But... I mean, bit operations aren't exactly necessary here, but it just so happens that every number in this library is a power of 2... it's kind of tempting.

tarcieri commented 9 months ago

looks like commits with backticks omit the text between them

That's likely your shell evaluating the contents of the backticks

nstilt1 commented 9 months ago

I've seen this mentioned in multiple issues: https://github.com/rust-random/rand/issues/1261#issuecomment-1341891257

The current design of BlockRngCore (and the proposed generic_const_exprs-based version) is not ideal. One issue for which I don't have a good solution is variability of block size dependent on runtime target feature detection. For example, ChaCha has different optimal block sizes dependent on whether hardware has SIMD extensions. Right now we simply use the largest supported block size, which is a bit suboptimal.

I saw some previous versions of this library with each backend having its own Core. An alternative solution that would (almost) be a cover-all, would be to optimize fill_bytes() a little bit further by using a [u32; 16] buffer to minimize how much needs to be written to the buffer, how much needs to be copied from the buffer at the beginning of fill_bytes(), as well as how much would need to be generated for the buffer with the tail. Then instead of

// Calculate how many 256-byte chunks are remaining to write to.
let num_chunks = (dest_len - dest_pos) >> 8;

unsafe {
    let mut chunk_ptr = dest.as_mut_ptr();
    chunk_ptr = chunk_ptr.add(dest_pos);
    for _chunk in 0..num_chunks {
        self.core.generate(chunk_ptr);
        chunk_ptr = chunk_ptr.add(256);
    }
}

We could delegate this piece of fill_bytes() to the backend.

// Calculate how many 64-byte blocks are remaining to write to.
let num_blocks = (dest_len - dest_pos) >> 6;

unsafe {
    let mut write_ptr = dest.as_mut_ptr();
    write_ptr = chunk_ptr.add(dest_pos);
    self.core.generate(write_ptr, num_blocks, self.buffer.0.as_mut_ptr() as *mut u8);
}

Then in the backend, it would need to determine how many times it needs to run to write num_blocks blocks to the pointer. For the backends that can generate 4 blocks in parallel, it seems that we'll need to be a little careful with the counter since both gen_ks_block() and gen_par_ks_blocks() both seem to generate 4 blocks.

Benefits of this approach:

friendly for machines that don't have SIMD
fill_bytes() would be a little bit more optimized for both SIMD and non-SIMD machines
There should be no difference in performance for the current benches that are currently in stream-ciphers/benches

The main downsides of this approach are:

it is only optimized for fill_bytes(), not next_u32(), etc
both gen_ks_block() and gen_par_ks_blocks() seem to generate 4 blocks every time they run, making this slightly tricky

It could be possible to also pass a pointer to the buffer as well. Maybe a signature like:

/// if dest_ptr is null, or if num_blocks is 0, it could only fill the buffer
unsafe fn generate(&mut self, dest_ptr: *mut u8, num_blocks: usize, buffer_ptr: *mut u8)

There are definitely some logistics to work out, but it might work.

nstilt1 commented 9 months ago

I suppose I could bring that issue up in rand. In the small_buffer branch I modified write_par_ks_blocks() to take num_blocks (where num_blocks <= 4) as an input, and some macros that can extract either 1 or 2 blocks to minimize code duplication.

I think this might bring some undue complexity though. Will look into trying variable buffer sizes before I commit to that. That would have required some significant changes to backends

nstilt1 commented 9 months ago

With this improvement

soft and sse2 use 64-byte buffers
neon and avx2 use 256-byte buffers
avx2 fill_bytes() is about as fast as apply_keystream(), if not a little bit faster
when the rng's index is at the max (after initialization or by reaching max through generating numbers), the backends skip the buffer to write 64-byte blocks directly to the destination

I believe the performance boost for avx2 is because backend::new() is only called once when rng_inner is called, and while it is running, it writes num_blocks 64-byte blocks to the dest_ptr instead of just writing 4 blocks to the dest_ptr.

nstilt1 commented 9 months ago

Sorry for breaking away from KISS... I just noticed that there was a lot of duplicate code, and I was only able to remove it once I realized how pointers can be used. Would you like me to switch to a commit before I switched to pointers, or is it okay to try to fix the remaining broken stuff with the rng? The main remaining issue that I notice is to use cpufeatures for the BUFFER_SIZE instead of the cfg statements I added, and possibly using cpufeatures for generate in rng.rs if it should go in rng.rs, but I don't know how cpufeatures needs to be handled.

nstilt1 commented 9 months ago

I went ahead and put master on the pointers branch and speed-ran a kiss implementation branch that behaves itself so far. It might fail a few checks at first, but it is simple, and I included most of the changes that I have recently made, such as some variable name changes, comments, and the LegacyCore and XChaChaCore. I tend to get carried away with things.

Will try to resolve failing checks later tonight, but I've got a backup branch.

tarcieri commented 8 months ago

Seems like you have some tough challenges regarding being able to abstract between cipher and rand_core. It's also a large PR which makes it hard to review.

To ensure the soundness of the unsafe code, it might be good to run some static analyzers on it, e.g. cargo miri and/or Rudra.

nstilt1 commented 8 months ago

Testing the current PR with miri

Soft

Running cargo miri test --features cipher,legacy,xchacha,rand_core with --cfg chacha20_force_soft

$ cargo miri test --features cipher,legacy,xchacha,rand_core
Preparing a sysroot for Miri (target: x86_64-unknown-linux-gnu)... done
warning: some crates are on edition 2021 which defaults to `resolver = "2"`, but virtual workspaces default to `resolver = "1"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
   Compiling libc v0.2.149
   Compiling getrandom v0.2.10
   Compiling rand_core v0.6.4
   Compiling rand_chacha v0.3.1
   Compiling chacha20 v0.9.1 (.../stream-ciphers-rng/chacha20)
    Finished test [optimized + debuginfo] target(s) in 0.89s
     Running unittests src/lib.rs (.../stream-ciphers-rng/target/miri/x86_64-unknown-linux-gnu/debug/deps/chacha20-40bb95b3512b497d)
warning: Miri does not support optimizations. If you have enabled optimizations by selecting a Cargo profile (such as --release) which changes other profile settings such as whether debug assertions and overflow checks are enabled, those settings are still applied.

running 16 tests
test rng::tests::test_chacha_clone_streams ... ok
test rng::tests::test_chacha_construction ... ok
test rng::tests::test_chacha_multiple_blocks ... ok
test rng::tests::test_chacha_nonce ... ok
test rng::tests::test_chacha_true_bytes ... ok
test rng::tests::test_chacha_true_values_a ... ok
test rng::tests::test_chacha_true_values_b ... ok
test rng::tests::test_chacha_true_values_c ... ok
test rng::tests::test_chacha_word_pos_wrap_exact ... ok
test rng::tests::test_chacha_word_pos_wrap_excess ... ok
test rng::tests::test_chacha_word_pos_zero ... ok
test rng::tests::test_rng_output ... ok
test rng::tests::test_set_and_get_equivalence ... ok
test rng::tests::test_set_word_pos ... ok
test rng::tests::test_wrapping_add ... ok
test xchacha::hchacha20_tests::test_vector ... ok

test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/mod.rs (/stream-ciphers-rng/target/miri/x86_64-unknown-linux-gnu/debug/deps/mod-5e9658f96257a96c)
warning: Miri does not support optimizations. If you have enabled optimizations by selecting a Cargo profile (such as --release) which changes other profile settings such as whether debug assertions and overflow checks are enabled, those settings are still applied.

running 12 tests
test chacha20_core ... ok
test chacha20_seek ... ok
test chacha20legacy_seek ... ok
test chacha20test::chacha20_encryption ... ok
test chacha20test::chacha20_keystream ... ok
test legacy::chacha20_legacy_core ... ok
test legacy::chacha20_legacy_seek ... ok
test legacy::chacha20_offsets ... ignored
test xchacha20::xchacha20_encryption ... ok
test xchacha20::xchacha20_keystream ... ok
test xchacha20::xchacha20_seek ... ok
test xchacha20_seek ... ok

test result: ok. 11 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out

   Doc-tests chacha20

running 4 tests
test chacha20/src/lib.rs - (line 44) ... ok
test chacha20/src/rng.rs - rng::ChaCha20Rng (line 542) ... ok
test chacha20/src/rng.rs - rng::ChaCha8Rng (line 538) ... ok
test chacha20/src/rng.rs - rng::ChaCha12Rng (line 540) ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.11s

AVX2

Running cargo miri test --features cipher,legacy,xchacha,rand_core with --cfg chacha20_force_avx2 -C target-feature=+avx2.

Returns an error due to using avx2. Miri doesn't seem to support avx2.

SSE2

Running cargo miri test --features cipher,legacy,xchacha,rand_core with --cfg chacha20_force_sse2 -C target-feature=+sse2.

Returns an error due to using sse2. Miri doesn't seem to support sse2.

nstilt1 commented 8 months ago

I am currently working on setting up Rudra. Shouldn't be much longer before I've got results.

nstilt1 commented 8 months ago

Rudra results are in. Not sure if I need to run it with each rustflag, but they were all the same.

chacha20$ cargo rudra --features cipher,legacy,rand_core,xchacha,zeroize
2023-12-23 17:04:13.084630 |INFO | [rudra-progress] Running cargo rudra
2023-12-23 17:04:13.390622 |INFO | [rudra-progress] Running rudra for target lib:chacha20
2023-12-23 17:04:14.158353 |INFO | [rudra-progress] Rudra started
2023-12-23 17:04:14.158504 |INFO | [rudra-progress] SendSyncVariance analysis started
2023-12-23 17:04:14.158519 |INFO | [rudra-progress] SendSyncVariance analysis finished
2023-12-23 17:04:14.158522 |INFO | [rudra-progress] UnsafeDataflow analysis started
2023-12-23 17:04:14.170619 |INFO | [rudra-progress] UnsafeDataflow analysis finished
2023-12-23 17:04:14.170650 |INFO | [rudra-progress] Rudra finished
2023-12-23 17:04:14.399805 |WARN | [cargo_rudra] Target test:mod is not supported
2023-12-23 17:04:14.399829 |WARN | [cargo_rudra] Target bench:mod is not supported
2023-12-23 17:04:14.399834 |INFO | [rudra-progress] cargo rudra finished

tarcieri commented 8 months ago

FYI, #338 bumps the crate to v0.10.0-pre, but also includes some other breaking API changes.

Rebasing on it shouldn't be too hard, I hope!

nstilt1 commented 8 months ago

I've made set_block_pos and get_block_pos as per some feedback in a recent rand issue #1369. I feel like those methods should be on the RNG Core rather than the RNG, but if it were to live on the RNG Core, then the RNG's index won't be changed or reset when the block_pos is set. That shouldn't be an issue as long as it is documented, but if the index should be reset, then the method might need to belong to the RNG.

Some questions I have: 1) Should the $ChaChaXRngCore be a wrapper around ChaChaCore to prevent cipher methods from being available to the end user? 2) Should the public set_block_pos and get_block_pos methods be on the RNG and reset the index, or on the core and not reset the index?

nstilt1 commented 7 months ago

I've now got a working version where the backends seem to be safely zeroizing when the ChaChaCore is dropped, but it depends on the PR for zeroize.

tarcieri commented 6 months ago

@nstilt1 can you rebase? This is looking mostly good and I'd like to get it merged.

nstilt1 commented 6 months ago

Sure thing. I apologize for the delay. I uh... I was mis-using ChaCha in several ways. I should have it rebased by tomorrow.

nstilt1 commented 5 months ago

*block_pos. I moved set/get_block_pos to the RNG rather than its core, where it wasn't able to change the RNG's index after trying to change the block pos

tarcieri commented 4 months ago

This looks very close now, though I'm not sure what's up with rust-toolchain.toml.save or the empty and seemingly unrelated and empty salsa20/benches/mod.rs

RustCrypto / stream-ciphers