API proposal: Compilation.GetDeterministicKey

jaredpar commented 2 years ago

Background and Motivation

The Compilation type is fully deterministic meaning that given the same inputs (SyntaxTree, CompilationOptions, etc ...) it will produce the same output. By that it means it will produce byte for byte equivalent binaries and the same diagnostics. This is extremely valuable in infrastructure because it allows for caching and opens the door for efficient distributed processing.

At the moment this is hard to leverage because consumers can't determine if two Compilation instances are equivalent for the purposes of determinism. Customers have to resort to hand written comparisons which requires a fairly intimate knowledge of the compiler (for example knowing what does and does not impact determinism). Such solutions are not version tolerant; every time the compiler adds a new property that impacts determinism the solution must be updated. Even when proper equality checks are in place this does not help distributed computing where equality must be decided across different processes.

The motivation here is to provide an API that returns a string based key for a given Compilation such that for two equivalent Compilation instances the key will also be equivalent. This will allow customers who want to leverage deterministic caching and communication to have a simple, and transmittable, way to full describe the Compilation they are processing.

Proposed API

namespace Microsoft.CodeAnalysis
{
    public class Compilation
    {
+        public static string GetDeterministicKey(
+            CompilationOptions compilationOptions,
+            ImmutableArray<SyntaxTree> syntaxTrees,
+            ImmutableArray<MetadataReference> references,
+            ImmutableArray<byte> publicKey,
+            ImmutableArray<AdditionalText> additionalTexts = default,
+            ImmutableArray<DiagnosticAnalyzer> analyzers = default,
+            ImmutableArray<ISourceGenerator> generators = default,
+            ImmutableArray<KeyValuePair<string, string>> pathMap = default,
+            EmitOptions? emitOptions = null,
+            DeterministicKeyOptions options = DeterministicKeyOptions.Default)
    }

    internal enum DeterministicKeyOptions
    {
+       /// <summary>
+       /// The default is to include all inputs to the compilation which impact the output of the 
+       /// compilation: binaries or diagnostics.
+       /// </summary>
+       Default = 0b0,

+       /// <summary>
+       /// Ignore all file paths, but still include file names, in the deterministic key.
+       /// </summary>
+       IgnorePaths = 0b0001,

+       /// <summary>
+       /// Ignore the versions of the tools contributing to the build: compiler version
+       /// runtime version, framework, os, etc ...
+       /// </summary>
+       IgnoreToolVersions = 0b0010,
    }
}

The return of GetDeterministicKey is an opaque string that full represents the content of the Compilation contents. Two Compilation which produce different output, diagnostics or binaries, will have different strings returned for this function.

The return of GetDeterministicKey can, and by default will, change between versions of the compiler. That is true of both the content of the string as well as the underlying format. The content must change because part of the input to compilation is the version of the compiler. The format will change as desired by the implementation. Consumers should not take any dependency on the format of this string other than it being an effective hash of the Compilation it came from.

The string returned here will be human readable and visually diffable. It will not be a minimal representation though. The content can, and is expected to be, compressed further with a hashing function such as SHA-256.

For example here is the proposed return for the following net5.0 program:

System.Console.WriteLine("Hello World");

{
  "compilation": {
    "toolsVersions": {
      "compilerVersion": "4.1.0-dev",
      "runtimeVersion": "6.0.0-rc.1.21451.13+d7619cd4b165c13430484e381042f055d4c5a9c7",
      "framework": ".NET 6.0.0-rc.1.21451.13",
      "os": "Microsoft Windows 10.0.19043"
    },
    "options": {
      "outputKind": "ConsoleApplication",
      "moduleName": null,
      "scriptClassName": "Script",
      "mainTypeName": null,
      "cryptoPublicKey": "",
      "cryptoKeyFile": null,
      "delaySign": null,
      "publicSign": false,
      "checkOverflow": false,
      "platform": "AnyCpu",
      "optimizationLevel": "Debug",
      "generalDiagnosticOption": "Default",
      "warningLevel": 4,
      "deterministic": true,
      "debugPlusMode": false,
      "referencesSupersedeLowerVersions": false,
      "reportSuppressedDiagnostics": false,
      "nullableContextOptions": "Disable",
      "specificDiagnosticOptions": [],
      "localtime": null,
      "unsafe": false,
      "topLevelBinderFlags": "None"
    }
  },
  "additionalTexts": [],
  "analyzers": [],
  "generators": [],
  "emitOptions": {},
  "references": [ 
     {
        "name": "netstandard",
        "version": "2.1.0.0",
        "publicKey": "02400480009400062000240052534131040010104b86c4cb78549b34bab61a3b180e23bfeb5b3ec3907441536a7e3cbd97f5f4cff857155a8928eaa29ebfd11cfbbad3ba7efea7bda3226c6a8d37a4cd33f714486b6ebc225985a638471e6ef571cc92a4613c0b8fa65d61ccee0cbe5f36330c9a01f4183559f1bef24cc2917c6d913e3a541333a1d5d9bed22b38cb",
        "mvid": "6ca937db-7ee1-4d79-b8c0-b6ac19aa6c81",
        "properties": {
          "kind": "Assembly",
          "embedInteropTypes": false
        }
      },
     // rest omitted for brevity
  ]
}

The full output can be seen here

Draft implementation of the PR can be seen here

Usage Examples

Output caching

Consider that build caching on outputs could be implemented by the compiler server or IDE. Computing this key is very cheap. The only extra work involved compared to actually compiling code is allocating a string. All of the work like generating checksums for source files is already done as a part of the compilation process. This is a very cheap way to generate a key for the output of compilation.

LSIF

IDE & Compiler communication Consider that today when running Visual Studio today both the C# IDE and Compiler are running "compilation" servers. The IDE is more focused on providing services such as completion, semantic coloring, etc ... and the Compiler is focused on compiling.

Under the hood though they are doing much of the same work: loading references, parsing files, binding, etc ... This is all duplicated work that eats up CPU cycles and RAM.

Given the presence of this API it opens the door for build to actually leverage the C# IDE for emit. Consider that the compiler server can use the output of GetDeterministicKey to efficiently, and correctly, communicate with the IDE server about the different Compilation they are each processing. It would be reasonable for the server to first check if the IDE can satisfy emit for a Compilation before attempting to process it directly.

Note: this scenario does require more work as the Compilation objects created in the IDE and Compiler differ in subtle ways. Having this function though would be motivation to clean these up such that they are equal.

Alternative Designs

Expose Builder

The underlying implementation of this API is done in a builder style type. An alternate approach is to expose this builder directly, put the options there and ask consumers to use that.

public sealed class CSharpDeterministicKeyBuilder
{
        public string GetDeterministicKey(
            ImmutableArray<AdditionalText> additionalTexts = default,
            ImmutableArray<DiagnosticAnalyzer> analyzers = default,
            ImmutableArray<ISourceGenerator> generators = default,
            EmitOptions? emitOptions = null,
            DeterministicKeyOptions options = DeterministicKeyOptions.Default)
}

I prefer the proposed approach because I think it's more discoverable (it's there when developers dot off of Compilation) and defining a type that just has a single function doesn't seem worth it.

Specify key format

The string format of the key is listed as opaque here. The reason for that is that is the desire to envolve the implementation without fear of breaking consumers. It's possible we discover elements that impact determinism that we want to encode, or decide to remove others or just change the format altogether to one that is easier to encode. Didn't want to have to consider compat here.

I don't think this de-values the API. The intent isn't to give consumers new information about a Compilation, it's to take existing information and encode it in a specific way.

That all being said we could decide to specify that the data is encoded in JSON, provide a schema and let consumers parse it as they desire.

Make it a Merkle tree

The JSON output is encoded such that every object in the tree is fully descriptive of the contents. That the object produced for two EmitObjects will be the same if the EmitObjects are equal and different if they are not.

The JSON output is effectively a tree of content based nodes. It very much resembles a Merkle Tree except that it's missing content checksums on the nodes. That could be added. I did not though because it was extra computation and allocations and was not strictly needed for this implementation.

Risks

Determinism is hard and nuanced in a few places. In order to correctly implement this API I had to walk through every type involved in Compilation, examine the properties and make decisions about whether it did or did not impact determinism. For all the items that impact determinism I had to find a string encoding approach that represented the content of that value.

For example for files and references it's not enough to encode just names. Instead the API has to encode that content in the return. Generally that is done by including a crytographic checksum of the content (much as Git does).

My biggest fear with this API is that I missed options. My implementation has explicit references to every option I did not encode and justification for exclusion. I will be encouraging reviewers to push back on this as well as search for items that I may have missed

Work Remaining

[x] Testing: the implementation is ready but I have neglected to fully test it. I want to get general buy off on the implementation approach (using a JSON) encoding scheme before finishing the testing. Otherwise I risk having to spend many hours re-baselining tests to a new format.

CyrusNajmabadi commented 2 years ago

Path handling will definitely be interesting here. Perhaps instead of having IgnorePaths there should be a common path root option that would then relativise all paths against. This would mean that that a file rename would cause a different key. However, if you moved the project to a different folder, then the key would stay the same.

jaredpar commented 2 years ago

The issue with common path root is that there isn't always a common root. Consider the scenario where a compilation is asked to compile the following source files:

c:\users\jaredpar\.nuget\packages\SourceDistroPackage\v1.0.0\Util.cs
d:\code\project\app.cs

There is no common root as they exist on completely different drives.

The solution to this problem though is to use /pathmap. That is our officially supported way for normalizing out the path differences. For example the above could be fixed by using the following: /pathmap:c:\users\jaredpar\.nuget=q:\nuget;d:\code=q:\code. That causes the compiler to act as if the source arguments looked like

q:\nuget\packages\SourceDistroPackage\v1.0.0\Util.cs
q:\code\project\app.cs

The intent being that every compilation could use /pathmap to normalize their builds such that paths don't factor into the output. The intent is that the key here represents the data after we apply /pathmap to the inputs. That is not mentioned in the description though (I will add it). Hence that is intended to continue being the method of normalizing out paths.

As you can imagine, or just see from what I wrote, this is non-trivial to get right. It also requires that all users building the code agree on the final path story. The intent of IgnorePaths was to just eliminate that by removing full paths from the output entirely. Normalizing everything to just a file name.

It seemed like there were scenarios which would benefit from that simplicity. Consider for example if you were writing a service that provided completion APIs. The path where the code was compiled on is not really relevant there. Hence you'd likely want to communicate and do caching without considering paths (more hits).

CyrusNajmabadi commented 2 years ago

The intent of IgnorePaths was to just eliminate that by removing full paths from the output entirely. Normalizing everything to just a file name.

Can you dive into details on the corner cases of this? for example, what if two files are in different paths, but have the same filename?

CyrusNajmabadi commented 2 years ago

The intent being that every compilation could use /pathmap to normalize their builds such that paths don't factor into the output.

Gotcha. I very much like that :)

jaredpar commented 2 years ago

The intent of IgnorePaths was to just eliminate that by removing full paths from the output entirely. Normalizing everything to just a file name.

Can you dive into details on the corner cases of this? for example, what if two files are in different paths, but have the same filename?

It doesn't really create any corner cases for the implementation here. The primary identification of files, for purposes of determinism, is about their content. That is always provided as a checksum in the file entry. For example:

"syntaxTrees": [
  {
    "fileName": "Utils.cs",
    "text": {
      "checksum": "1b565cf6f2d814a4dc37ce578eda05fe0614f3d",
      "checksumAlgorithm": "Sha1",
      "encoding": "Unicode (UTF-8)"
    },
    "parseOptions": {
      "languageVersion": "Preview",
      "specifiedLanguageVersion": "Preview"
    }
  }
]

Hence there is no risk that this option changes the identity of the file. It's all about what tolerance the consumer does or does not have for file paths.

Overall file paths are a bit of a gray area in determinism. They concretely matter in that the same compilation on two different paths will produce two different outputs. That is more because the paths get embedded into the PDB for debugging purposes, handling of #line directives, etc ... This is primarily about diagnostics though, not behavior. The vast majority of compilations will not observe a difference in their compilations based on where the code was compiled. Yet because they concretely matter build systems have to expend a lot of effort normalizing the paths; either by always building in the same root or doing the work to find the right /pathmap setting.

My motivation for IgnorePaths is that in most cases it functionally doesn't matter. Also by the time you are at the Roslyn API level you often lack the information necessary to create /pathmap correctly. Or at least you lack the information the SDK uses to do automatic /pathmap generation. Even with that information the experts often spend considerable time discussing what the correct approach is.

Essentially IgnorePaths is an easy button let customers at the API level decide if paths do or don't matter for the purposes of identifying identical compilations.

RikkiGibson commented 2 years ago

Paths also matter for CallerInfo attributes, right?

jaredpar commented 2 years ago

Correct. That is why I was careful to put "vast majority" 😄

That is the most direct way in which paths can impact behavior. There are a few others but they start to get very corner case.

To be clear though: I'm not attempting to brush this off entirely. Paths can impact behavior. But it's generally in the margins and I think there are significant sets of customers who would prefer they be able to ignore them for the purpose of identifying equal compilations.

This is not a default though because the default is strict determinism. But it seems like a desirable tweak vs. teaching everyone about /pathmap and the rigors of getting that correct.

RikkiGibson commented 2 years ago

A few other random thoughts:

It would be useful to see a few examples of how the "references" elements are encoded in the issue description. e.g. does it just include the MVIDs of things, or is there more to it.
It feels like when build and IDE adopt this API we will want to understand what it means when there are cache misses. Perhaps the SDK or some people's projects will end up passing "design-time only" values for things. We might want to look into the ability to diagnose issues like "this similar compilation doesn't match this one because of some property that got set by component X". It feels like when weird differences like this exist, the perf/memory usage hit could be significant, so ideally it would be hard to quietly regress this.
There are also major complexities around the lifetimes of the compiler and IDE servers. Consider the case where we run a build and then launch VS. If we strictly have a solution where the build attempts to delegate to the IDE server, then the IDE won't be able to benefit from the work that occurred for the initial build. Are we ok with that?

rainersigwald commented 2 years ago

The content must change because part of the input to compilation is the version of the compiler.

I don't see "compiler version" in the JSON outputs. Oversight or intentional?

The string returned here will be human readable and visually diffable. It will not be a minimal representation though. The content can, and is expected to be, compressed further with a hashing function such as SHA-256.

Can you expand on the reasoning here a bit? I'm not against it, but returning a SHA-like byte array or something feels like it could be implemented more efficiently, and this being cheap is an explicit goal. That said, having stared at horrible byte/int hashes and tried to divine what was actually different I can certainly appreciate the plain-text-ness.

CyrusNajmabadi commented 2 years ago

It doesn't really create any corner cases for the implementation here. The primary identification of files, for purposes of determinism, is about their content. That is always provided as a checksum in the file entry. For example

My concern is also thigns like 'sorting' of those items. Say you ran things once and got:

"syntaxTrees": [
  {
    "fileName": "Utils.cs",
    "text": {
      "checksum": "1b565cf6f2d814a4dc37ce578eda05fe0614f3d",
      "checksumAlgorithm": "Sha1",
      "encoding": "Unicode (UTF-8)"
    },
    "parseOptions": {
      "languageVersion": "Preview",
      "specifiedLanguageVersion": "Preview"
    }
  },
  {
    "fileName": "Utils.cs",
    "text": {
      "checksum": "somethingelse",
      "checksumAlgorithm": "Sha1",
      "encoding": "Unicode (UTF-8)"
    },
    "parseOptions": {
      "languageVersion": "Preview",
      "specifiedLanguageVersion": "Preview"
    }
  }
]

But then you ran it again, and those items swapped location. i think we'd need to be clear that that won't happen. or, that if it does it does represent a change to the deterministic key.

--

note: i have no problem wit this. it's more that i'm used to people screwing up thigns with filenames. so i'd almost prefer just not having them at all versus having their names, but having people then do confusing things with same-named files in different paths.

jaredpar commented 2 years ago

@rainersigwald

I don't see "compiler version" in the JSON outputs. Oversight or intentional?

The gist I have in the issue is out of date with the latest that does have it. Will update that in just a sec here.

Can you expand on the reasoning here a bit? I'm not against it, but returning a SHA-like byte array or something feels like it could be implemented more efficiently, and this being cheap is an explicit goal. That said, having stared at horrible byte/int hashes and tried to divine what was actually different I can certainly appreciate the plain-text-ness

The reason for this is explicitly for human consumption / debugging purposes. I work in the compiler team and I personally get extremely discouraged when I have two compilations which I thought were equal turn out to be in fact different. It can take a lot of effort to track down why that is happening and often involves digging through binlog files, disassembling, diffing, etc ... It sucks.

Anyone using this API is at some point likely going to go through the same exercise: why are these two compilations different? By returning just a SHA we give them no help. But returning the diffable human readable form makes it incredibly simple to see what is different.

I'm very open to ideas like making it two distinct APIs here.

jaredpar commented 2 years ago

@RikkiGibson

It would be useful to see a few examples of how the "references" elements are encoded in the issue description. e.g. does it just include the MVIDs of things, or is there more to it.

The gist in the issue has a full output of the call and that includes reference elements. I can add one to the inline sample so that the general structure is there and then omit the rest for brevity.

We might want to look into the ability to diagnose issues like "this similar compilation doesn't match this one because of some property that got set by component X"

This is an action I'm very interested in. Historically the IDE and Compiler have differed on build items in tiny ways. To the point that most engineers would argue that the IDE and Compiler have the same information until they dig into very low levels and discover the tiny differences. This API would make it very easy to determine both if they are equivalent and if they're not what are the differences. No need for deep debugging, just a simple text diff. I expect this will be very informative to us and allow us to quickly address the gaps.

There are also major complexities around the lifetimes of the compiler and IDE servers. Consider the case where we run a build and then launch VS. If we strictly have a solution where the build attempts to delegate to the IDE server, then the IDE won't be able to benefit from the work that occurred for the initial build. Are we ok with that?

I agree there are a lot of challenges getting to a point where we could effectively put the Compiler + IDE servers into a single process and remove loads of duplicate work. I don't believe this API alone will come close to solving it. But this API does provide one of the necessary primitives for communicating compilations across processes: the ability to know if each process agrees on the code being compiled.

RikkiGibson commented 2 years ago

We discussed this in the incremental build workstream today. Main thoughts were:

Does this fit in any way with the checksums that are given by the Workspaces layer? It seems like different components can depend on getting checksums for specific attributes of different workspace elements, such as "what are this project's references", etc. @shyamnamboodiripad. Basically the idea is we should be careful to not inadvertently introduce something that is identical-yet-subtly-different-than the existing systems of checksums. In my view, knowingly introducing it is ok as long as we know why we're doing it, blindly introducing it is not 😉
It seems like one of the big long term benefits of this API could be the sharing of physical compilations between the build and IDE. Is there any ability to leverage this or similar API for new forms of incrementality, such as incremental binding of method bodies.

jaredpar commented 2 years ago

It seems like one of the big long term benefits of this API could be the sharing of physical compilations between the build and IDE

That is certainly one application I've thought of. Fundamentally it means that different processes, even across machines, can now efficiently determine if they are holding identical Compilation objects. In the case they're not it provides an efficient way to determine what exactly is different (at least to a human).

Once you can establish you have the same Compilation then it opens up all manner of possibilities like cached diagnostics, compilations, etc ...

CyrusNajmabadi commented 2 years ago

Does this fit in any way with the checksums that are given by the Workspaces layer?

This is a great question, and i definitely think there's overlap here. Where i see us having a challenge is that we want to be able to produce these checksums with as little extra work as possible. So, for example, we want to generate the checksum prior to creating a compilation (as compilations practively parse files today, which adds a lot of cost). If the compilation layer could produce a compiler more cheaply, and could build a deterministickey/checksum on it with zero overhead (beyond what the ide would pay) i think we would switch immediately.

THis would also solve issues that @jasonmalinowski and i realized recently. Namely that some of our checksums are used to make semantic-compilation decisions, but we roll some things into our checksum that aren't relevant. For example, if any project property changes, the checksum changes (regardless of it it would affect teh compiler). THis is a bug on our end. But we'd love to just address the bug by saying: the compiler will take care of this for us :)

RikkiGibson commented 2 years ago

@CyrusNajmabadi are you saying that making the compiler responsible for the checksum of a Compilation is enough to be helpful here, and that compiler for example doesn't need to provide any new API for finer-grained checksums of things within the compilation?

So, for example, we want to generate the checksum prior to creating a compilation (as compilations practively parse files today, which adds a lot of cost).

I hadn't thought about this. That is fairly severe. Are you aware of the roadblocks to making lazy parsing happen in the compilation?

Since the API is basically "give me a snapshot of the compilation options that could affect scenario XYZ that I care about", it would be ideal to just solve the proactive parsing issue and expose the API on Compilation, but if we are unable to do this, it could be necessary to provide API which takes directly the "parts" of a compilation instead. List<SourceText>, list of references, the different kinds of options, etc.

CyrusNajmabadi commented 2 years ago

@CyrusNajmabadi are you saying that making the compiler responsible for the checksum of a Compilation is enough to be helpful here, and that compiler for example doesn't need to provide any new API for finer-grained checksums of things within the compilation?

i believe so. we might also like things like doc-level checksum. but it's less important.

I hadn't thought about this. That is fairly severe. Are you aware of the roadblocks to making lazy parsing happen in the compilation?

I cna't think of anything mandatory. Today we proactively parse so that we can keep the decl-table up to date. I think not only could we be lazy on the decl table, but that that might be a positive (given that we could potentially more aggressively update a compilation without needing that proactive parse).

The potential downside (and @jasonmalinowski might be terrified here) is that we often risk long chains of objects held alive by laziness. In other words, we often represent a tree as a lazy evaluation of an edit against a previous tree. Because you pull immediately, it realizes the full latest trees (incrementally produced). If we're fully lazy here, we have a strong danger of really long chains of things kept alive that then have to do linear-chains of resolution once the lazy is evaluated.

CyrusNajmabadi commented 2 years ago

it could be necessary to provide API which takes directly the "parts" of a compilation instead. List, list of references, the different kinds of options, etc.

I was thinking this as well. And i am very amenable to this. It seems like having both forms could be nice. The 'i have hte pieces, give me the result based on that' and the 'i made the real compilation-layer value, get me its key'. both seem sensible.

RikkiGibson commented 2 years ago

It sounds like it would be best to expose a version of this API outside of Compilation at least when we are trying it out and gauging the impact. Don't want to tie the decision on how to cache and reuse compilations with the decision on how to manage lifetimes of lazy-edited trees.

jaredpar commented 2 years ago

The Compilation is only necessary in that it's a holder of all the elements we need for the hash and lets us easily pivot to C# and VB specific behavior. There is nothing that fundamentally requires Compilation. Fundamentally what we need to do the work is:

Compilation Options
References
Source and additional files
Analyzers and Generators
Emit Options

Does that line up with what the IDE has at the time they want to do the hashing?

CyrusNajmabadi commented 2 years ago

I believe it does. Though @jasonmalinowski would be good for verification . It would also be interesting to see how we tend to look at checksums/determinism here for those particular subpoints. For example, here's what we do to get our 'checksum' for a PEReference:

        private static Checksum CreatePortableExecutableReferenceChecksum(PortableExecutableReference reference, CancellationToken cancellationToken)
        {
            using var stream = SerializableBytes.CreateWritableStream();

            using (var writer = new ObjectWriter(stream, leaveOpen: true, cancellationToken))
            {
                WritePortableExecutableReferencePropertiesTo(reference, writer, cancellationToken);
                WriteMvidsTo(TryGetMetadata(reference), writer, cancellationToken);
            }

            stream.Position = 0;
            return Checksum.Create(stream);
        }

Very specifically we presume from a small amount of data read in that we know what uniquely defined the PERef. This does mean that if two refs had hte same mvids and properties, we'd consider them the same.

I'm not sure if that aligns with the compiler's view, and how much it would read it to make its determination. FOr example, i could envision the compiler reading the entire PE file in and basing it entirely on the entire contents.

jaredpar commented 2 years ago

I'm not sure if that aligns with the compiler's view, and how much it would read it to make its determination. FOr example, i could envision the compiler reading the entire PE file in and basing it entirely on the entire contents.

That aligns with our view. If two DLLs have different content and the same MVID that is basically violating the underlying principles of build. We don't spend time looking for this in our key code today.

CyrusNajmabadi commented 2 years ago

Good to know! We should probably dive on all of these to align, or at least understand if there is any divergenceI don't have time now, but I'm happy to help by eow

dotnet / roslyn