SeedableRng::Seed and from_hashable

dhardy commented 6 years ago

Edit: PR no longer shows the code thanks to our rebasing of Rand history, but the commit is still available here.

Implement seeding via hash function, as discussed in #18.

This is a very rough implementation.

To discuss, regarding the hash function:

SeaHash is exposed for general usage, simply because it can be (if we're going to do this, we must stabilise it, so there's no worry that it will be replaced). This is a little odd, but seems sensible on the whole.
We could hide SeaHash and from_hashable behind a feature gate (hash or seahash) until we're more comfortable with the idea — that allows stabilising rand-core without this feature, in theory. But hopefully we don't need to do this.
More AsBytesFixed impls are needed, e.g. for u64
I'm unsure that AsBytesFixed is the best way of going about this. I'm considering moving the hashing into the trait implementation with trait Hash<H> { fn hash(state: &mut H, val: Self); } or similar. This allows specialised implementations of SeaHash.
How general do we make our SeaHash impl? I mean, do we allow pushing multiple items to it? This is functionality we don't need but requires very little extra code and possibly a small performance hit, but it does make the SeaHash impl we expose significantly more useful to other users. (We could also consider whether from_hashable should be more general, e.g. consume a SeaHash struct.)

Regarding SeedableRng:

Making Seed an associated type is a compromise; SeedSize associated constant doesn't appear to work; we do however restrict Seed to types supported by Finalize
Making Seed a u64 array instead of u8 array is certainly more convenient for us; any reason we shouldn't do this?
The new SeedableRng::from_seed impls for Isaac aren't 100% backwards compatible (old version allowed much larger seeds).

Lots to discuss, but I think this is moving forward. (@ticki if you get the time, I'd love your thoughts.)

dhardy commented 6 years ago

I realised SeedableRng::Seed isn't so simple... on BE using a [u64; N] array requires conversion from byte sequences and to [u32; N] arrays. So options:

keep using u64 arrays, require byte-swapping when reading byte-seq keys and when converting to u32 arrays (e.g. for ChaCha)
allow use of u32 arrays where preferred by the RNG, byte swap on reading and hash-function finalisation
use u8 arrays, byte-swap on hash-function finalisation, on writing large-integer seeds, and on reading in RNGs

burdges commented 6 years ago

Afaik all CSPRNGs seed with [u8; N], presumably all employing algorithms entered into crypto competitions anyways. Also, CSPRNGs should never be seeded with less. If you want to seed a PRNG with less, then either you should use a faster non-cryptographic PRNG, or else you'll do key crazy slow stretching algorithms, like Argon2.

Are there actually situations where performance suggest feeding short seed to a non-cryptographic PRNG? Appears no: If you create few PRNGs then obviously you do not care about the seed length. If you create enough PRNGs that consuming CSPRNG output to seed them all takes too long then consuming even small amounts still takes ages, so you should spawn another PRNG with the desired speed and quality to seed your PRNGs. We support this use case by exposing the ReseedableRng machinery used by thred_rng(). I therefore think this whole from_hashable business sounds misguided.

dhardy commented 6 years ago

@burdges are you also saying that from_rng should only allow seeding from a CSPRNG (CryptoRng)? I'm tentatively for that change but it does mean thread_rng needs to be a CSPRNG, though if we go with #53 I guess that's ok.

Slightly confused because I don't talk about key stretching here. Unless you mean you don't like my modifications to SeaHash? I'm reasonably convinced they're fit for purpose; alternatively we need to switch to another hash function with 128 bit or more output.

burdges commented 6 years ago

No. It's true user-space CSPRNGs would usually be seeded with from_rng, but they might be seeded from another source, like secret from disk or a key exchange.

I'm saying: Afaik, non-cryptographic PRNG should usually be seeded with from_rng, not a single u64, because Rngs are designed to produce the larger quantity of output needed to fill their seed/state quickly.

dhardy commented 6 years ago

I'm still confused about what you mean.

On the first account, because seeding a CSPRNG from a known secret from disk or key exchange can use from_seed, so doesn't need to use from_rng. What I was saying is that from_rng could require that the source RNG is a CSPRNG.

On the second account, the point of from_hashable is that when a user wants to seed a non-crypto PRNG with a fixed seed, we want to make it easy for them to convert whatever seed the user may enter ("my seed", 1, "obayuany-ofu", or whatever other nonsense the user enters) into a "nice" seed with even bit distribution. So from_hashable will never be used when the source is another RNG.

dhardy commented 6 years ago

I got a fair bit of work done tidying this up. Not convinced whether we should use SeaHash but it's mostly okay.

I also ported Google's HighwayHash to Rust. It's much more complex than SeaHash and has some oddities. Still need to test against upstream output and optimise a lot; not sure at the moment whether I wish to do this.

burdges commented 6 years ago

I'm saying this: PRNGs tend towards large internal states. Hash functions tend to be optimized for small outputs. PRNGs are optimized for larger outputs. You should expect seeding using a hash function to be slower than seeding using a PRNG. If from_hashable does not exist for either performance or quality then should we conclude that it's not meant for production code?

dhardy commented 6 years ago

Ah. How then would you propose we condense arbitrary-length user input, potentially with a lot of bias (e.g. text), into a reasonably short key with low bias and good bit propagation (avalanche)? Hash functions seem like a good solution to me, just so long as we pick one capable of at least 128 (preferably 256) bits of output.

We could do something like split the user input into state-sized blocks and XOR together, calling a PRNG state-transition function between each XOR. But that's just inventing our own (non-crypto) hash function.

dhardy commented 6 years ago

I found a way to move from_hashable out of rand-core, which I think is preferable for most people. It does unfortunately mean users have to use rand::FromHashable; now, but I guess we can live with that.

dhardy commented 6 years ago

Updated: rebased and squashed. I still have the old history locally, but don't see much value.

pitdicker commented 6 years ago

High time to get this merged :smile:. Moving from_hashable out of rand-core seems like a good idea!

How confident are you in the custom finalizers? And is there anything specific I should look at?

I would like it if the RNG tests would only depend on things in rand-core. Especially if they are to be split out into another crate. Does it make sense to add a test_stdng_construction test in rand instead?

dhardy commented 6 years ago

Actually, I think I will replace SeaHash with MetroHash; it's better known and reviewed and has similar performance, as well as native 128-bit output and 256-bits of state (IIRC), and a permutation function which should make it relatively easy to output any amount of state. But I didn't get that done yet.

pitdicker commented 6 years ago

If we go with 128-bit hashes, there is a lot more choice. MetroHash, CityHash, FarmHash, MurmurHash, SpookyHash, maybe more. I think that it is good to look at the performance, but that simplicity is more important. Maybe I am wrong, but it seems to me MetroHash, CityHash and FarmHash (and maybe xxHash) are part of a series of hashes that come and go. It would be nice if we could pick something that does not seem badly outdated over ~5 years. I have a little preference for Murmurhash3 there.

This repro contains a list of hashes with benchmarks: https://github.com/rurban/smhasher

128-bit MetroHash and MurmurHash3 seem about evenly matched for small inputs. I wonder what their performance on x86 is.

dhardy commented 6 years ago

That repo benchmarks hash functions by throughput on reasonably large data sizes. We don't care much about that. It also doesn't say much about security (other than some stuff about hash tables which I don't even think is correct).

I agree that simplicity is fairly important; it seems most hash functions innards aren't that complex, but the input/output functions can get complex to handle multiple input sizes, optimise special cases, and some other functionality.

Hmm, this conversation is split between two threads?

pitdicker commented 6 years ago

That repo benchmarks hash functions by throughput on reasonably large data sizes.

If you click on the name of the hash, there are also the results for hashing 1, 2, 3, 4, etc. bytes. And the last column is useful for us, because that shows if it is statistically ok. We don't really have security considerations here, right?

Hmm, this conversation is split between two threads?

Sorry :smile: Seemed to fit there. Link for others that may read this: https://github.com/dhardy/rand/issues/18#issuecomment-354618223

dhardy / rand

SeedableRng::Seed and from_hashable #62