Alternate UUID v4 randomness

fare commented 5 years ago

If UUIDs are used at scale, or worse, in security-sensitive environments, it becomes important to have a cryptographically strong PRNG, which Random.State isn't.

Shouldn't the uuidm library use a cryptographically strong PRNG when generating a UUIDv4 ?

dbuenzli commented 5 years ago

If UUIDs are used at scale, or worse, in security-sensitive environments, it becomes important to have a cryptographically strong PRNG, which Random.State isn't.

What defines a cryptographically strong PRNG exactly ? Maybe that's not strong enough, but note that Random.State passes the Diehard tests.

/cc @pqwy

dbuenzli commented 5 years ago

(but if needed we could certainly abstract the random bits a bit further)

fare commented 5 years ago

https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

fare commented 5 years ago

Sorry, the correct S word was secure, not strong.

pqwy commented 5 years ago

@fare Which property do you expect this other RNG to provide?

fare commented 5 years ago

I'd expect the RNG to create identifiers that are not only unique but unpredictable by an adversary who could otherwise deny service or worse by registering different content under the anticipated UUID, or create other kinds of havoc.

I also want to be able to consider the UUID as a blackbox and not have to worry whether the implementation does or doesn't have a common security property I may want later — "of course the library does it 100% right" and "no, you don't need to rewrite your own library" are the answers I'd love to hear from every library in the ecosystem I use.

pqwy commented 5 years ago

I'll concur that using Random makes it possible for someone to observe a sequence of generated UUIDs and start predicting them, and that this no longer holds if you switch to a CSPRNG.

I have no idea if this is expected of UUIDs, though.

I also want to be able to consider the UUID as a blackbox [..]

You can. The question is what is the black box supposed to be doing.

fare commented 5 years ago

The are many good reasons to use a (suitably seeded) CSPRNG when generating UUIDs: https://security.stackexchange.com/questions/120352/should-i-use-a-cryptograpically-secure-random-number-generator-when-i-generate-i

fare commented 5 years ago

I believe uuidm should use some cryptographic library for its CSPRNG as well as for its MD5 and SHA1 support.

pqwy commented 5 years ago

I can only see theoretical considerations on stack exchange, where people are rehashing what it means to use a CSPRNG. We established that.

(Note that on a full-fledged OS, using Random.init does not lead to collisions because it uses the system RNG to seed, so UUIDs remain probably-unique.)

In my use, I would never rely on something like uuidm if I wanted cryptographically secure -- that is, strongly unpredictable -- objects. I would tap into an appropriate RNG directly, and treat the UUID part as formatting. But I might be misunderstanding the point of UUIDs, and it's a decision for @dbuenzli to make.

I can offer the fact that there is an appropriate OCaml library with both hashing and a CSPRNG around, and it's about to get much slimmer. But if uuidm is expected to run only on a full OS, it can avoid that dep by using the system RNG (either /dev/urandom or getrandom (2), plus the win equivalent).

pqwy commented 5 years ago

Upon further reading of the API... it could definitely use an easier way to plug in other sources of randomness. Like v4 : (unit -> int) -> t or, better still, v4: (int -> bytes) -> t.

dbuenzli commented 5 years ago

I also want to be able to consider the UUID as a blackbox and not have to worry whether the implementation does or doesn't have a common security property I may want later — "of course the library does it 100% right"

The problem is that people tend to ascribe wishes that may not hold because "hey of course the library does it 100% right". In a context where there's not a single true answer (e.g. some people only care about speed and unicity, others about sequence reproducibility, others may need to use a particular (cs)prng, etc.) it's better for a library to make you think about what you are doing rather than do 100% right what is wrong for you.

In any case this library is neither getting a system dependency nor a dependency on a cryptographic library however slim -- and slick I'm sure -- it might be. In this particular case it seems better to put things in control of the client which also leads to more explicit code if you are writing security sensitive code.

Along the lines of what @pqwy suggested this could take the form of one or more of these sigs:

Adding a function val v4 : (unit -> int) -> t
Adding a function val v4 : (int -> bytes) -> t
Adding a function val v4 : bytes -> t which assumes 16-bytes random bytes, copies them and overwrites the few bits that need to be.

I would tend to go with 3. which puts the most freedom on the client on how to operate. This however assumes overwriting 6 bits of the given 128 random bits doesn't bias anything (?). Otherwise we could take the first 122 bits of the given bytes and redistribute them (but that entails more bit fiddling).

The advantage of 2. is that it could save us an allocation, alternatively in 3. we could not copy. All this depends a bit on what kind of interface we expect from CSPRNG in the OCaml world (@pqwy ?).

I will also make sure to comment on the current v4 functions that they are unsuitable for use if the sequence of uuids needs to be unpredicatable to an observer which if I understand correctly is the only problem the current functions have.

fare commented 5 years ago

Agreed that putting the client in control is a good design, especially if for some reason, adding a dependency is bad.

NB: maybe for V3 and V5, the md5 or sha1 can also be provided by the client, then? No need for a slow implementation in OCaml and/or it can be provided as a separate library.

Or maybe there can be three distinct libraries:

uuidm-core that includes no crypto, user must provide his own.
uuidm-classic that uses uuidm-core with the same quick-n-dirty-but-portable crypto as present.
uuidm-cryptokit that uses cryptokit for its crypto.

dbuenzli commented 5 years ago

NB: maybe for V3 and V5, the md5 or sha1 can also be provided by the client, then? No need for a slow implementation in OCaml and/or it can be provided as a separate library.

If it ain't broke don't fix it. For anything like this to happen you will have to come up with a concrete problematic use case and numbers.

pqwy commented 5 years ago

This however assumes overwriting 6 bits of the given 128 random bits doesn't bias anything (?).

It does not bias the remaining 122 bits.

All this depends a bit on what kind of interface we expect from CSPRNG in the OCaml world (@pqwy ?).

Essentially t -> int -> bytes for some t, with caller-owned buffer. 2. and 3. are both ok.

I wouldn't think too hard about allocations of fixed-size strings. A 16-byte string is stored in 4 words, and blazingly fast to allocate on the minor.

I will also make sure to comment on the current v4 functions that they are unsuitable for use if the sequence of uuids needs to be unpredicatable to an observer which if I understand correctly is the only problem the current functions have.

In practical terms, if you observe 55 consecutive outputs of Random, you know the state. Each output i is fully determined by outputs i - 31 and i - 55, 30-bits wide on all arches. uuidm takes 5 consecutive outputs, so 11 UUIDs fully refresh the state. Each UUID contains 24, 28, 30, 24 and 16 bits of its outputs. A group of 3 bytes in an UUID is therefore determined by similar byte groups in UUIDs 7 and 11 places back, up to approximately 18 bits of uncertainty (depends on which exact part of the UUID). So without any actual cleverness, you can recover 2 out of 55 outputs by doing a brute-force of about 2^18. Repeat that 28 times, and you can start exactly predicting the UUIDs.

So this is very doable.

But if you want to go the extra mile, you can hash a sequence of k outputs before you copy that into an UUID. What's observable is h(x_0, ..., x_(k-1)), h(x_k, ..., x_(2k-1), ... This breaks correlation between individual byte groups; now you have to attack the entire UUID as a unit.

As far as I can tell, to recover these portions of the RNG state in less than brute-forcing k * 30 bits, you need a full pre-image attack on the hash. Even SHA1, these days considered unsafe, only suffers from collision attacks.

So in addition to allowing externally-supplied bytes, I would make UUIDs using Random by doing SHA1 over, say, 10 consecutive ints (for a solid 300-bit uncertainly) and then formatting the result. This alone should make it infeasible to predict future outputs.

EDIT

... and about 37x slower with 134x more allocation. You can do a double-MD5 for a 5x time hit instead. MD5's pre-image hasn't been broken either -- I can't guarantee that the NSA won't be able to predict your next UUID, but very, very few attackers will be.

fare commented 5 years ago

Yes, the SHA1 solution is probably good enough (except for slow performance). Double-MD5 seems a bit dubious to me. You might as well use SHA1, or use an actual crypto library if you care about performance.

pqwy commented 5 years ago

Double-MD5 seems a bit dubious to me.

Why?

fare commented 5 years ago

Serious people have long stopped studying MD5, but the crackers of NSA and other agencies probably haven't, and quite possibly broken in more ways than known. If you care about performance, use a performant crypto library, or at least pick a fast hash such as blake2b (faster than MD5, and considered safe).

pqwy commented 5 years ago

Notice that in the construction above, all you require is that an attacker cannot invert the hash. This gives them (lots) of information about a slice of the state.

MD5 had its collision resistance completely broken in the last 10 years. This makes it unsuitable for, say, signatures, where a hash stands as a proxy for a message that the signer approved of. But even if you had effective second-preimage attack, having a different RNG state that hashes the same as the RNG state that was used helps you in no way whatsoever to reconstruct it.

MD5 is also woefully inadequate for storing passwords, because we can compute it comparatively quickly. Here is an example that does about 2e11 hashes per second. Even if your password consists of mashing the keyboard uniformly at random, it still contains only printable characters, so at about 6.5 bits per byte, it takes an average of 3h to recover an 8-character password with that equipment.

The same equipment takes an average of 113,144,318,612,534,871,342,139,615 years to recover the 150-bit segment of the RNG state used for a single UUID, if you use only 5 consecutive integers to create it.

No-one has ever suggested that MD5's pre-image resistance is compromised. This would be a major result. (Here is a discussion, but it's about passwords.)

The claim I'm making is that any hash that cannot be effectively inverted is equivalent for the construction above: an attacker observing only a sequence of consecutive generated UUIDs cannot predict them.

Note that this is all largely a moot point, because reusing the same Random.State for almost anything else apart from generating UUIDs will give out enough information to reconstruct it, backwards and forwards. So all of this goes with a big fat warning to consider your threat model and use a proper CSPRNG when applicable, and is only intended to give a bit of a security margin to people who ignore it.

Serious people have long stopped studying MD5, but the crackers of NSA and other agencies probably haven't, and quite possibly broken in more ways than known.

That's sort of an argument from herbal medicine. On one hand, inverting MD5 is a cardinal result. On the other hand, those same crackers have by now implanted not only our computers but our kidneys too, so we really shouldn't make them angry.

(Judging by their past track record, this also implies that every kid with an IRC client has remote execution capability on my kidneys.)

dbuenzli commented 5 years ago

Thanks for the analysis @pqwy.

So this is very doable.

But then again you might not care.

I'd rather keep the v4 functions as they are now for those who want performance. Also some people may rely on the reproducibility of v4_gen.

pqwy commented 5 years ago

¯\_(ツ)_/¯

Cost-benefit-wise I think you should probably hash the outputs. Reproducibility is the same.

dbuenzli commented 5 years ago

Cost-benefit-wise I think you should probably hash the outputs. Reproducibility is the same.

Well if I start hashing the outputs now the reproducibility will change...

dbuenzli commented 5 years ago

I added something in 526fda7dc76548b483e57f9ecc1af10ec0550094

dbuenzli / uuidm

Alternate UUID v4 randomness #8