Closed fare closed 5 years ago
If UUIDs are used at scale, or worse, in security-sensitive environments, it becomes important to have a cryptographically strong PRNG, which
Random.State
isn't.
What defines a cryptographically strong PRNG exactly ? Maybe that's not strong enough, but note that Random.State
passes the Diehard tests.
/cc @pqwy
(but if needed we could certainly abstract the random bits a bit further)
Sorry, the correct S word was secure, not strong.
@fare Which property do you expect this other RNG to provide?
I'd expect the RNG to create identifiers that are not only unique but unpredictable by an adversary who could otherwise deny service or worse by registering different content under the anticipated UUID, or create other kinds of havoc.
I also want to be able to consider the UUID as a blackbox and not have to worry whether the implementation does or doesn't have a common security property I may want later — "of course the library does it 100% right" and "no, you don't need to rewrite your own library" are the answers I'd love to hear from every library in the ecosystem I use.
I'll concur that using Random
makes it possible for someone to observe a sequence of generated UUIDs and start predicting them, and that this no longer holds if you switch to a CSPRNG.
I have no idea if this is expected of UUIDs, though.
I also want to be able to consider the UUID as a blackbox [..]
You can. The question is what is the black box supposed to be doing.
The are many good reasons to use a (suitably seeded) CSPRNG when generating UUIDs: https://security.stackexchange.com/questions/120352/should-i-use-a-cryptograpically-secure-random-number-generator-when-i-generate-i
I believe uuidm
should use some cryptographic library for its CSPRNG as well as for its MD5 and SHA1 support.
I can only see theoretical considerations on stack exchange, where people are rehashing what it means to use a CSPRNG. We established that.
(Note that on a full-fledged OS, using Random.init
does not lead to collisions because it uses the system RNG to seed, so UUIDs remain probably-unique.)
In my use, I would never rely on something like uuidm
if I wanted cryptographically secure -- that is, strongly unpredictable -- objects. I would tap into an appropriate RNG directly, and treat the UUID part as formatting. But I might be misunderstanding the point of UUIDs, and it's a decision for @dbuenzli to make.
I can offer the fact that there is an appropriate OCaml library with both hashing and a CSPRNG around, and it's about to get much slimmer. But if uuidm
is expected to run only on a full OS, it can avoid that dep by using the system RNG (either /dev/urandom
or getrandom (2)
, plus the win equivalent).
Upon further reading of the API... it could definitely use an easier way to plug in other sources of randomness. Like v4 : (unit -> int) -> t
or, better still, v4: (int -> bytes) -> t
.
I also want to be able to consider the UUID as a blackbox and not have to worry whether the implementation does or doesn't have a common security property I may want later — "of course the library does it 100% right"
The problem is that people tend to ascribe wishes that may not hold because "hey of course the library does it 100% right". In a context where there's not a single true answer (e.g. some people only care about speed and unicity, others about sequence reproducibility, others may need to use a particular (cs)prng, etc.) it's better for a library to make you think about what you are doing rather than do 100% right what is wrong for you.
In any case this library is neither getting a system dependency nor a dependency on a cryptographic library however slim -- and slick I'm sure -- it might be. In this particular case it seems better to put things in control of the client which also leads to more explicit code if you are writing security sensitive code.
Along the lines of what @pqwy suggested this could take the form of one or more of these sigs:
val v4 : (unit -> int) -> t
val v4 : (int -> bytes) -> t
val v4 : bytes -> t
which assumes 16-bytes random bytes, copies them and overwrites the few bits that need to be. I would tend to go with 3. which puts the most freedom on the client on how to operate. This however assumes overwriting 6 bits of the given 128 random bits doesn't bias anything (?). Otherwise we could take the first 122 bits of the given bytes and redistribute them (but that entails more bit fiddling).
The advantage of 2. is that it could save us an allocation, alternatively in 3. we could not copy. All this depends a bit on what kind of interface we expect from CSPRNG in the OCaml world (@pqwy ?).
I will also make sure to comment on the current v4
functions that they are unsuitable for use if the sequence of uuids needs to be unpredicatable to an observer which if I understand correctly is the only problem the current functions have.
Agreed that putting the client in control is a good design, especially if for some reason, adding a dependency is bad.
NB: maybe for V3 and V5, the md5 or sha1 can also be provided by the client, then? No need for a slow implementation in OCaml and/or it can be provided as a separate library.
Or maybe there can be three distinct libraries:
uuidm-core
that includes no crypto, user must provide his own.uuidm-classic
that uses uuidm-core
with the same quick-n-dirty-but-portable crypto as present.uuidm-cryptokit
that uses cryptokit
for its crypto.NB: maybe for V3 and V5, the md5 or sha1 can also be provided by the client, then? No need for a slow implementation in OCaml and/or it can be provided as a separate library.
If it ain't broke don't fix it. For anything like this to happen you will have to come up with a concrete problematic use case and numbers.
This however assumes overwriting 6 bits of the given 128 random bits doesn't bias anything (?).
It does not bias the remaining 122 bits.
All this depends a bit on what kind of interface we expect from CSPRNG in the OCaml world (@pqwy ?).
Essentially t -> int -> bytes
for some t
, with caller-owned buffer. 2. and 3. are both ok.
I wouldn't think too hard about allocations of fixed-size strings. A 16-byte string is stored in 4 words, and blazingly fast to allocate on the minor.
I will also make sure to comment on the current
v4
functions that they are unsuitable for use if the sequence of uuids needs to be unpredicatable to an observer which if I understand correctly is the only problem the current functions have.
In practical terms, if you observe 55 consecutive outputs of Random
, you know the state. Each output i
is fully determined by outputs i - 31
and i - 55
, 30-bits wide on all arches. uuidm
takes 5 consecutive outputs, so 11 UUIDs fully refresh the state. Each UUID contains 24, 28, 30, 24 and 16 bits of its outputs. A group of 3 bytes in an UUID is therefore determined by similar byte groups in UUIDs 7 and 11 places back, up to approximately 18 bits of uncertainty (depends on which exact part of the UUID). So without any actual cleverness, you can recover 2 out of 55 outputs by doing a brute-force of about 2^18. Repeat that 28 times, and you can start exactly predicting the UUIDs.
So this is very doable.
But if you want to go the extra mile, you can hash a sequence of k
outputs before you copy that into an UUID. What's observable is h(x_0, ..., x_(k-1))
, h(x_k, ..., x_(2k-1)
, ... This breaks correlation between individual byte groups; now you have to attack the entire UUID as a unit.
As far as I can tell, to recover these portions of the RNG state in less than brute-forcing k * 30
bits, you need a full pre-image attack on the hash. Even SHA1, these days considered unsafe, only suffers from collision attacks.
So in addition to allowing externally-supplied bytes
, I would make UUIDs using Random
by doing SHA1 over, say, 10 consecutive ints (for a solid 300-bit uncertainly) and then formatting the result. This alone should make it infeasible to predict future outputs.
EDIT
... and about 37x slower with 134x more allocation. You can do a double-MD5 for a 5x time hit instead. MD5's pre-image hasn't been broken either -- I can't guarantee that the NSA won't be able to predict your next UUID, but very, very few attackers will be.
Yes, the SHA1 solution is probably good enough (except for slow performance). Double-MD5 seems a bit dubious to me. You might as well use SHA1, or use an actual crypto library if you care about performance.
Double-MD5 seems a bit dubious to me.
Why?
Serious people have long stopped studying MD5, but the crackers of NSA and other agencies probably haven't, and quite possibly broken in more ways than known. If you care about performance, use a performant crypto library, or at least pick a fast hash such as blake2b (faster than MD5, and considered safe).
Notice that in the construction above, all you require is that an attacker cannot invert the hash. This gives them (lots) of information about a slice of the state.
MD5 had its collision resistance completely broken in the last 10 years. This makes it unsuitable for, say, signatures, where a hash stands as a proxy for a message that the signer approved of. But even if you had effective second-preimage attack, having a different RNG state that hashes the same as the RNG state that was used helps you in no way whatsoever to reconstruct it.
MD5 is also woefully inadequate for storing passwords, because we can compute it comparatively quickly. Here is an example that does about 2e11 hashes per second. Even if your password consists of mashing the keyboard uniformly at random, it still contains only printable characters, so at about 6.5 bits per byte, it takes an average of 3h to recover an 8-character password with that equipment.
The same equipment takes an average of 113,144,318,612,534,871,342,139,615 years to recover the 150-bit segment of the RNG state used for a single UUID, if you use only 5 consecutive integers to create it.
No-one has ever suggested that MD5's pre-image resistance is compromised. This would be a major result. (Here is a discussion, but it's about passwords.)
The claim I'm making is that any hash that cannot be effectively inverted is equivalent for the construction above: an attacker observing only a sequence of consecutive generated UUIDs cannot predict them.
Note that this is all largely a moot point, because reusing the same Random.State
for almost anything else apart from generating UUIDs will give out enough information to reconstruct it, backwards and forwards. So all of this goes with a big fat warning to consider your threat model and use a proper CSPRNG when applicable, and is only intended to give a bit of a security margin to people who ignore it.
Serious people have long stopped studying MD5, but the crackers of NSA and other agencies probably haven't, and quite possibly broken in more ways than known.
That's sort of an argument from herbal medicine. On one hand, inverting MD5 is a cardinal result. On the other hand, those same crackers have by now implanted not only our computers but our kidneys too, so we really shouldn't make them angry.
(Judging by their past track record, this also implies that every kid with an IRC client has remote execution capability on my kidneys.)
Thanks for the analysis @pqwy.
So this is very doable.
But then again you might not care.
I'd rather keep the v4
functions as they are now for those who want performance. Also some people may rely on the reproducibility of v4_gen
.
¯\_(ツ)_/¯
Cost-benefit-wise I think you should probably hash the outputs. Reproducibility is the same.
Cost-benefit-wise I think you should probably hash the outputs. Reproducibility is the same.
Well if I start hashing the outputs now the reproducibility will change...
I added something in 526fda7dc76548b483e57f9ecc1af10ec0550094
If UUIDs are used at scale, or worse, in security-sensitive environments, it becomes important to have a cryptographically strong PRNG, which
Random.State
isn't.Shouldn't the
uuidm
library use a cryptographically strong PRNG when generating a UUIDv4 ?