Open jdukes opened 8 years ago
Is there a less disruptive fix than "switch to SipHash" ?
The idea with SipHash being to pick a random seed for the PRF on startup? We may need hash persistence across precompiles. @vtjnash?
Is this only a problem with hashing of strings? Or should the language also protect numeric types against similar attacs?
It's really easy to construct colissions numbers that give similar hashes, since the hash algorithm is invertible.
@ivarne The majority of the risk comes from strings (http headers, cookies, and parameters) for the attack scenario I called out. However, as @simonster pointed out, number type collisions may actually be easier to find (I didn't look in to that). If numtype keys are broken and aren't also fixed then you'll run in to a problem later on with JSON.
In JSON, aren't all keys strings? Doesn't mean the concern shouldn't be addressed for numbers too, of course.
To support precompilation/serialization of global Dict{String,T}
objects, we could store the seed in the Dict
. However, this would be discard the security gains of Sip if the module subsequently stored untrusted data in the global dict.
Alternatively, we could automatically rehash!
dictionaries of string/symbol keys when deserializing. And maybe provide a FastDict
type that uses a simpler/faster hash (e.g. MurmurHash) with a fixed seed, for cases where security is not a concern (e.g. untrusted data is never hashed) and performance/serialization is paramount.
@kmsquire, all keys in JSON are strings. The applications for numeric-key hashes are very different, and the algorithms are very different; I don't think we should worry about that here. Even Python only uses SipHash for strings and byte arrays.
fwiw, I doubt serializing a global non-empty Dict
works now.
@yuyichao, the manual says: Dictionary and set types, or in general anything that depends on the output of a hash(key) method, are a trickier case. In the common case where the keys are numbers, strings, symbols, ranges, Expr, or compositions of these types (via arrays, tuples, sets, pairs, etc.) they are safe to precompile. And I believe this is correct (and is relied upon by a number of modules).
Dicts are serialized by writing out the keys and values compactly, and re-inserting on deserialize, so we already rehash in this case.
@stevengj oh yeah, you're right on that.
See also the discussion and links here: https://discourse.julialang.org/t/use-of-murmurhash3-for-hashing-strings/9818
A cryptographically secure randomized hash seed seems like a good minimum thing to implement here.
(Once the seed is known, even a crypto hash function can't protect you, since hash tables use only the low-order bits and hence are vulnerable to brute forcing collisions.)
Using a cryptographically secure hash function for hash
itself seems to mainly defend against the circumstance where the value of hashkey(x) mod n
is visible to the attacker, but the seed is not, which seems like a much lower-tier concern for us, particularly since users in sensitive applications can swap in their own hash functions with no performance penalty (by wrapping the keys in a custom struct
with its own hash
method). Disclaimer: I'm not a cryptographer.
Julia uses the
hash
function in theDict
object when keys are strings. Under the hood this uses MurmurHash3. While for most input MurmurHash3 results in an optimal distribution, an attacker can craft input that will explicitly collide.Any system that takes untrusted input from users and passes it in to a Julia
Dict
becomes vulnerable to a Denial of Service commonly known as HashDoS. (Here's a bit of extra info on the issue identified in most major web languages at the end of 2011). The Julia web stack and JSON.jl are specifically vulnerable as they necessarily push untrusted input (headers, cookies, parameters, JSON) in toDict
objects.