Add support for SipHash128 with all-zero key

briansmith commented 4 months ago

I referenced your work on the ASCII identifier MD5 collision in https://github.com/rust-lang/rust/issues/10389#issuecomment-2181169783. It would be cool if we could extend your tools to also work on SipHash128 with an all-zero key being used as a hash function. Is that something that you would accept? Any tips for contributing it?

briansmith commented 4 months ago

See also discussion in https://github.com/typst/comemo/issues/3.

Note that I'm suggesting adding support for the weaker SipHash 1-3 with 128-bit output, not the default stronger SipHash 2-4 with the default 64-bit output.

cr-marcstevens commented 4 months ago

Does rust currently uses 64-bit TypeIds ?

cr-marcstevens commented 4 months ago

What would happen if there was a collision?

bjorn3 commented 4 months ago

Does rust currently uses 64-bit TypeIds ?

It used to, but has been changed to 128-bit a while back.

What would happen if there was a collision?

You can transmute one type to another if they have the same TypeId hash even if doing so would violate memory safety.

cr-marcstevens commented 4 months ago

Only transmutation by choice? But no bad side consequences that could creep up on people?

briansmith commented 4 months ago

Does rust currently uses 64-bit TypeIds ?

Rust originally used the original SipHash 2-4 with 64-bit output for TypeId. I believe somebody demonstrated a collision so it switched to SipHash 2-4 128-bit output. Then it switched to SipHash 1-3 with 128-bit output.

What would happen if there was a collision?

As @bjorn3 said, various "safe" APIs in Rust, basically anything that uses TypeId or Any in Rust, would become unsound; i.e. there could be memory-safety issues even in code that doesn't use unsafe if they use those APIs with types that have collisions in their TypeIds.

See https://github.com/rust-lang/rust/issues/10389#issuecomment-147919570 for an example from back when the shorter TypeIds were used.

briansmith commented 4 months ago

Only transmutation by choice? But no bad side consequences that could creep up on people?

The Any and TypeId APIs are easily reachable from anything that "downcasts", e.g. https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.downcast. Those don't even mention Any or TypeId.

The SipHash-based hasher is also used for some equality comparisons within the compiler, so potentially when compiling one type could be silently substituted for another. So in theory just compiling two crates (or a single crate) that defines types with typeid collision could result in a miscompilation.

There is effort underway to further use SipHash for other uses, e.g. https://github.com/rust-lang/cargo/issues/13171#issuecomment-2181266802. I find the discussion about that a bit confusing still.

cr-marcstevens commented 4 months ago

But a collision of course must follow the apropriate format of how types are converted to a message input to siphash. Is there a (simple) description of this format?

briansmith commented 4 months ago

See also https://github.com/rust-lang/rust/issues/41215 which seems to imply that it is used in the incremental compilation cache.

See also the use in the mold linker: https://github.com/rui314/mold/commit/fa8e95a289f911c0c47409d5848c993cb50c8862. My understanding is that when Siphash(k, x) = Siphash(k, y) all references to the the object code y will be replaced to instead refer to the object code x (i.e. x and y will be "folded" together as identical code). IIUC from the twitter discussion, mold may use a random key.

bjorn3 commented 4 months ago

Is there a (simple) description of this format?

It depends on the rustc version and isn't directly written down in the rustc source code but created using code generation while compiling rustc (to be precise HashStable is derived using a proc macro). The format seems to roughly be the following:

Take the index of the TyKind variant and hash it as 64-bit(?) little endian number (counting from zero). Then hash all the fields. For simple enums recurse the same way. For DefId, look up the DefPathHash (which should have collisions detected already in the absence of dylibs) and hash the resulting 128-bit hash stored in the DefPathHash. For any Ty you can hash the inner TyKind. For RawList hash the length of the list as 64-bit little endian number followed by hashing all elements in order. For other types you will have to look at the StableHash implementation or just ignore the respective TyKind variant if the rest of the variants give you enough freedom in picking a hash input with the properties you need for a hash collision.

briansmith commented 4 months ago

How hard would it be to create a program that takes a source code representation of a $ty and produces the bytes that are fed into SipHasher? I am not familiar with the compiler enough to know if it is feasible to extract a small bit of code from the compiler or re-implement to produce such a thing.

bjorn3 commented 4 months ago

If you ignore DefPathHash it should be reasonably straight forward if a bit tedious to write code that acts the exact same way. For DefPathHash (which is used for extern types as well as closures, generators, and functions definitions) the answer is less straight forward. The DefPathHash consists of two halves. The first is the StableCrateId, a 64-bit hash of the crate name, if the crate is a library or not, all -Cmetadata arguments (after sorting and deduplicating) and the rustc version string. This is generated at https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_span/src/def_id.rs#L146-L189 Rustc also checks that no two crates with the same StableCrateId are loaded at compile time, which ensures it is unique at runtime too if you ignore dylibs. The other half of the DefPathHash is a hash which identifies a definition within a single crate. Rustc enforces this uniqueness too. It is created across https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_hir/src/definitions.rs (search for DefPathHash::new)

briansmith commented 4 months ago

Another motivating example: https://github.com/rust-lang/cargo/issues/6529#issuecomment-2181473376

cr-marcstevens / hashclash

Add support for SipHash128 with all-zero key #40