JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.73k stars 5.48k forks source link

Julia is vulnerable to HashDoS #16172

Open jdukes opened 8 years ago

jdukes commented 8 years ago

Julia uses the hash function in the Dict object when keys are strings. Under the hood this uses MurmurHash3. While for most input MurmurHash3 results in an optimal distribution, an attacker can craft input that will explicitly collide.

Any system that takes untrusted input from users and passes it in to a Julia Dict becomes vulnerable to a Denial of Service commonly known as HashDoS. (Here's a bit of extra info on the issue identified in most major web languages at the end of 2011). The Julia web stack and JSON.jl are specifically vulnerable as they necessarily push untrusted input (headers, cookies, parameters, JSON) in to Dict objects.

tkelman commented 8 years ago

Is there a less disruptive fix than "switch to SipHash" ?

Keno commented 8 years ago

The idea with SipHash being to pick a random seed for the PRF on startup? We may need hash persistence across precompiles. @vtjnash?

ivarne commented 8 years ago

Is this only a problem with hashing of strings? Or should the language also protect numeric types against similar attacs?

simonster commented 8 years ago

It's really easy to construct colissions numbers that give similar hashes, since the hash algorithm is invertible.

jdukes commented 8 years ago

@ivarne The majority of the risk comes from strings (http headers, cookies, and parameters) for the attack scenario I called out. However, as @simonster pointed out, number type collisions may actually be easier to find (I didn't look in to that). If numtype keys are broken and aren't also fixed then you'll run in to a problem later on with JSON.

kmsquire commented 8 years ago

In JSON, aren't all keys strings? Doesn't mean the concern shouldn't be addressed for numbers too, of course.

stevengj commented 8 years ago

To support precompilation/serialization of global Dict{String,T} objects, we could store the seed in the Dict. However, this would be discard the security gains of Sip if the module subsequently stored untrusted data in the global dict.

Alternatively, we could automatically rehash! dictionaries of string/symbol keys when deserializing. And maybe provide a FastDict type that uses a simpler/faster hash (e.g. MurmurHash) with a fixed seed, for cases where security is not a concern (e.g. untrusted data is never hashed) and performance/serialization is paramount.

stevengj commented 8 years ago

@kmsquire, all keys in JSON are strings. The applications for numeric-key hashes are very different, and the algorithms are very different; I don't think we should worry about that here. Even Python only uses SipHash for strings and byte arrays.

yuyichao commented 8 years ago

fwiw, I doubt serializing a global non-empty Dict works now.

stevengj commented 8 years ago

@yuyichao, the manual says: Dictionary and set types, or in general anything that depends on the output of a hash(key) method, are a trickier case. In the common case where the keys are numbers, strings, symbols, ranges, Expr, or compositions of these types (via arrays, tuples, sets, pairs, etc.) they are safe to precompile. And I believe this is correct (and is relied upon by a number of modules).

JeffBezanson commented 8 years ago

Dicts are serialized by writing out the keys and values compactly, and re-inserting on deserialize, so we already rehash in this case.

jdukes commented 8 years ago

@stevengj oh yeah, you're right on that.

stevengj commented 6 years ago

See also the discussion and links here: https://discourse.julialang.org/t/use-of-murmurhash3-for-hashing-strings/9818

A cryptographically secure randomized hash seed seems like a good minimum thing to implement here.

(Once the seed is known, even a crypto hash function can't protect you, since hash tables use only the low-order bits and hence are vulnerable to brute forcing collisions.)

Using a cryptographically secure hash function for hash itself seems to mainly defend against the circumstance where the value of hashkey(x) mod n is visible to the attacker, but the seed is not, which seems like a much lower-tier concern for us, particularly since users in sensitive applications can swap in their own hash functions with no performance penalty (by wrapping the keys in a custom struct with its own hash method). Disclaimer: I'm not a cryptographer.