Open RyanGlScott opened 3 months ago
it is very unlikely that you'll get the same disambiguator values as output if you make even a small change to the input
Cargo computes disambiguators based on various metadata, including Cargo dependencies, enabled features, and various other factors. However, they don't depend on the actual source of the crate being compiled. If you're seeing changes to the disambiguator value when only the source code has changed, then it's possible mir-json is using something else for its disambiguators (maybe the "stable version hash", which is much stricter) and could be switched to use the Cargo disambiguator instead.
or if you run mir-json on the same input, but on different machines
Compiling for a different target can actually compile entirely different code due to conditional compilation. Maybe we should compile tests for a specific target, regardless of the host platform, to ensure this is stable.
This could be achieved by performing a stable, lexicographic sort on all of the crate names in use and assigning them integers in ascending order of the crate names
The main issue I see with this approach is that when two crates in the dependency graph have the same name, it doesn't provide a way to deterministically assign them different disambiguator. The purpose of the disambiguator is to distinguish between different crates of the same name, such as when a crate depends (transitively) on both foo
v1 and foo
v2.
Another issue is that we need to assign disambiguators consistently across different mir-json invocations. In cargo crux-test
, each of the dependencies is a separate mir-json invocation, and the outputs from all invocations are linked together at the end, so the paths (including disambiguators) produced by the different invocations all need to match up.
If the Cargo-provided disambiguators aren't stable enough, we could potentially implement a renumbering like this as a postprocessing pass after linking. We would use the normal disambiguators during the build, but at the end, replace all disambiguators in the linked json output with a counter value for each crate name. rustc assigns CrateNum
s to the main crate and its dependencies in a deterministic way, so we could export the list of CrateNum
, name, and disambiguator for each crate, then for each name, the lowest CrateNum
with that name gets disambiguator 0, the next-lowest gets 1, and so on. This won't be 100% consistent across different targets (which can have different depedencies entirely), but it should be pretty close, especially for crates that depend only on the standard library.
If the Cargo-provided disambiguators aren't stable enough, we could potentially implement a renumbering like this as a postprocessing pass after linking.
Yes, I think this is what I had in mind.
This won't be 100% consistent across different targets (which can have different depedencies entirely)
Fair enough. I think it would much more likely to be portable than it is currently, especially for testing purposes.
We shouldn't have tests that have per-target conditional compilation :-) or if we do, they should be getting built for all targets on all test runs. So I don't think that part is a problem. I guess at some point we might have a test that depends on some library that's got target-dependent code in it, but I think we can worry about that when/if it happens.
Meanwhile we can definitely create new disambiguators that depend on whatever we want; it's just a form of alpha-conversion after all. The trick is to figure out how to do it such that small perturbations don't renumber the entire output. rustc/cargo doesn't really have any reason to care about this (and for cargo, arguably having these things be unstable is a feature that helps prevent accidental crosslinking or version skew) so it's not surprising that the native disambiguators don't really meet our needs.
Currently,
mir-json
produces crate names with disambiguators (e.g., the7570fc8a
intest/7570fc8a
). The process by whichmir-json
determines what disambiguator values to use is rather hard to predict, because (as I understand it) it is influenced byrustc
command-line arguments, operating system information, architecture information, and other idiosyncrasies. As such, it is very unlikely that you'll get the same disambiguator values as output if you make even a small change to the input (or if you runmir-json
on the same input, but on different machines).This is very annoying from a testing perspective, as:
mir-json
, the resulting diff is incredibly noisy due to every disambiguator value changing.crux-mir
andcrux-mir-comp
test suites. In order to achieve this, we have to replace all of the disambiguator values post facto with a constant value (see the code here in thecrux-mir
test suite).This has lead @sauclovian-g and I to wonder: can we solve this problem at the
mir-json
level? Instead of generating raw disambiguator values, it would be nice ifmir-json
had an option that replaced each disambiguator with deterministic numbers. This could be achieved by performing a stable, lexicographic sort on all of the crate names in use and assigning them integers in ascending order of the crate names. For example, if you had cratestest/<disambiguator-for-test>
andunicode-xid/<disambiguator-for-unicode-xid>
, then this option would replace them with something liketest/00000000
andunicode-xid/00000001
. Importantly, these would be the values that are chosen each time, regardless of what the specific values of<disambiguator-for-test>
and<disambiguator-for-unicode-xid>
are.