Open briansmith opened 4 months ago
See also discussion in https://github.com/typst/comemo/issues/3.
Note that I'm suggesting adding support for the weaker SipHash 1-3 with 128-bit output, not the default stronger SipHash 2-4 with the default 64-bit output.
Does rust currently uses 64-bit TypeIds ?
What would happen if there was a collision?
Does rust currently uses 64-bit TypeIds ?
It used to, but has been changed to 128-bit a while back.
What would happen if there was a collision?
You can transmute one type to another if they have the same TypeId
hash even if doing so would violate memory safety.
Only transmutation by choice? But no bad side consequences that could creep up on people?
Does rust currently uses 64-bit TypeIds ?
TypeId
. I believe somebody demonstrated a collision so it switched to SipHash 2-4 128-bit output. Then it switched to SipHash 1-3 with 128-bit output.What would happen if there was a collision?
As @bjorn3 said, various "safe" APIs in Rust, basically anything that uses TypeId
or Any
in Rust, would become unsound; i.e. there could be memory-safety issues even in code that doesn't use unsafe
if they use those APIs with types that have collisions in their TypeIds.
See https://github.com/rust-lang/rust/issues/10389#issuecomment-147919570 for an example from back when the shorter TypeIds were used.
Only transmutation by choice? But no bad side consequences that could creep up on people?
The Any
and TypeId
APIs are easily reachable from anything that "downcasts", e.g. https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.downcast. Those don't even mention Any
or TypeId
.
The SipHash-based hasher is also used for some equality comparisons within the compiler, so potentially when compiling one type could be silently substituted for another. So in theory just compiling two crates (or a single crate) that defines types with typeid collision could result in a miscompilation.
There is effort underway to further use SipHash for other uses, e.g. https://github.com/rust-lang/cargo/issues/13171#issuecomment-2181266802. I find the discussion about that a bit confusing still.
But a collision of course must follow the apropriate format of how types are converted to a message input to siphash. Is there a (simple) description of this format?
See also https://github.com/rust-lang/rust/issues/41215 which seems to imply that it is used in the incremental compilation cache.
See also the use in the mold linker: https://github.com/rui314/mold/commit/fa8e95a289f911c0c47409d5848c993cb50c8862. My understanding is that when Siphash(k, x) = Siphash(k, y) all references to the the object code y will be replaced to instead refer to the object code x (i.e. x and y will be "folded" together as identical code). IIUC from the twitter discussion, mold may use a random key.
Is there a (simple) description of this format?
It depends on the rustc version and isn't directly written down in the rustc source code but created using code generation while compiling rustc (to be precise HashStable
is derived using a proc macro). The format seems to roughly be the following:
Take the index of the TyKind variant and hash it as 64-bit(?) little endian number (counting from zero). Then hash all the fields. For simple enums recurse the same way. For DefId
, look up the DefPathHash
(which should have collisions detected already in the absence of dylibs) and hash the resulting 128-bit hash stored in the DefPathHash
. For any Ty
you can hash the inner TyKind
. For RawList
hash the length of the list as 64-bit little endian number followed by hashing all elements in order. For other types you will have to look at the StableHash
implementation or just ignore the respective TyKind
variant if the rest of the variants give you enough freedom in picking a hash input with the properties you need for a hash collision.
How hard would it be to create a program that takes a source code representation of a $ty
and produces the bytes that are fed into SipHasher? I am not familiar with the compiler enough to know if it is feasible to extract a small bit of code from the compiler or re-implement to produce such a thing.
If you ignore DefPathHash
it should be reasonably straight forward if a bit tedious to write code that acts the exact same way. For DefPathHash
(which is used for extern types as well as closures, generators, and functions definitions) the answer is less straight forward. The DefPathHash
consists of two halves. The first is the StableCrateId
, a 64-bit hash of the crate name, if the crate is a library or not, all -Cmetadata
arguments (after sorting and deduplicating) and the rustc version string. This is generated at https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_span/src/def_id.rs#L146-L189 Rustc also checks that no two crates with the same StableCrateId
are loaded at compile time, which ensures it is unique at runtime too if you ignore dylibs. The other half of the DefPathHash
is a hash which identifies a definition within a single crate. Rustc enforces this uniqueness too. It is created across https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_hir/src/definitions.rs (search for DefPathHash::new
)
Another motivating example: https://github.com/rust-lang/cargo/issues/6529#issuecomment-2181473376
I referenced your work on the ASCII identifier MD5 collision in https://github.com/rust-lang/rust/issues/10389#issuecomment-2181169783. It would be cool if we could extend your tools to also work on SipHash128 with an all-zero key being used as a hash function. Is that something that you would accept? Any tips for contributing it?