Add option for producing deterministic disambiguator values

RyanGlScott commented 3 months ago

Currently, mir-json produces crate names with disambiguators (e.g., the 7570fc8a in test/7570fc8a). The process by which mir-json determines what disambiguator values to use is rather hard to predict, because (as I understand it) it is influenced by rustc command-line arguments, operating system information, architecture information, and other idiosyncrasies. As such, it is very unlikely that you'll get the same disambiguator values as output if you make even a small change to the input (or if you run mir-json on the same input, but on different machines).

This is very annoying from a testing perspective, as:

Every time you need to regenerate MIR JSON files using a new version of mir-json, the resulting diff is incredibly noisy due to every disambiguator value changing.
Most of the time, you want your test cases to be insensitive to the particular disambiguator values in use, as is the case for the golden tests in the crux-mir and crux-mir-comp test suites. In order to achieve this, we have to replace all of the disambiguator values post facto with a constant value (see the code here in the crux-mir test suite).
If you have any test cases that do rely on specific disambiguator values (as is the case in SAW here), you need to remember to update that disambiguator value each time it is regenerated. This is very tiresome.

This has lead @sauclovian-g and I to wonder: can we solve this problem at the mir-json level? Instead of generating raw disambiguator values, it would be nice if mir-json had an option that replaced each disambiguator with deterministic numbers. This could be achieved by performing a stable, lexicographic sort on all of the crate names in use and assigning them integers in ascending order of the crate names. For example, if you had crates test/<disambiguator-for-test> and unicode-xid/<disambiguator-for-unicode-xid> , then this option would replace them with something like test/00000000 and unicode-xid/00000001. Importantly, these would be the values that are chosen each time, regardless of what the specific values of <disambiguator-for-test> and <disambiguator-for-unicode-xid> are.

spernsteiner commented 3 months ago

it is very unlikely that you'll get the same disambiguator values as output if you make even a small change to the input

Cargo computes disambiguators based on various metadata, including Cargo dependencies, enabled features, and various other factors. However, they don't depend on the actual source of the crate being compiled. If you're seeing changes to the disambiguator value when only the source code has changed, then it's possible mir-json is using something else for its disambiguators (maybe the "stable version hash", which is much stricter) and could be switched to use the Cargo disambiguator instead.

or if you run mir-json on the same input, but on different machines

Compiling for a different target can actually compile entirely different code due to conditional compilation. Maybe we should compile tests for a specific target, regardless of the host platform, to ensure this is stable.

This could be achieved by performing a stable, lexicographic sort on all of the crate names in use and assigning them integers in ascending order of the crate names

The main issue I see with this approach is that when two crates in the dependency graph have the same name, it doesn't provide a way to deterministically assign them different disambiguator. The purpose of the disambiguator is to distinguish between different crates of the same name, such as when a crate depends (transitively) on both foo v1 and foo v2.

Another issue is that we need to assign disambiguators consistently across different mir-json invocations. In cargo crux-test, each of the dependencies is a separate mir-json invocation, and the outputs from all invocations are linked together at the end, so the paths (including disambiguators) produced by the different invocations all need to match up.

If the Cargo-provided disambiguators aren't stable enough, we could potentially implement a renumbering like this as a postprocessing pass after linking. We would use the normal disambiguators during the build, but at the end, replace all disambiguators in the linked json output with a counter value for each crate name. rustc assigns CrateNums to the main crate and its dependencies in a deterministic way, so we could export the list of CrateNum, name, and disambiguator for each crate, then for each name, the lowest CrateNum with that name gets disambiguator 0, the next-lowest gets 1, and so on. This won't be 100% consistent across different targets (which can have different depedencies entirely), but it should be pretty close, especially for crates that depend only on the standard library.

RyanGlScott commented 3 months ago

If the Cargo-provided disambiguators aren't stable enough, we could potentially implement a renumbering like this as a postprocessing pass after linking.

Yes, I think this is what I had in mind.

This won't be 100% consistent across different targets (which can have different depedencies entirely)

Fair enough. I think it would much more likely to be portable than it is currently, especially for testing purposes.

sauclovian-g commented 3 months ago

We shouldn't have tests that have per-target conditional compilation :-) or if we do, they should be getting built for all targets on all test runs. So I don't think that part is a problem. I guess at some point we might have a test that depends on some library that's got target-dependent code in it, but I think we can worry about that when/if it happens.

Meanwhile we can definitely create new disambiguators that depend on whatever we want; it's just a form of alpha-conversion after all. The trick is to figure out how to do it such that small perturbations don't renumber the entire output. rustc/cargo doesn't really have any reason to care about this (and for cargo, arguably having these things be unstable is a feature that helps prevent accidental crosslinking or version skew) so it's not surprising that the native disambiguators don't really meet our needs.

GaloisInc / mir-json

Add option for producing deterministic disambiguator values #74