Investigate and fix low-level performance issues in Dafny

Here are some WIP notes that I started writing up about Dafny performance during my last two on-calls. They are not complete, but hopefully they're a start as we consider where to focus optimization efforts, and I won't have much time to work on them more in the near future:

Step 0: Get an idea of where time is spent

We can test the cache's impact by turning on caching and verifying two snapshots:

$ time Dafny -compile:0 Compiler.dfy # Single run as a reference
…
real    0m20.923s

$ cp Compiler.dfy Compiler.v0.dfy # Create one snapshot
$ time Dafny -verifySnapshots:3 -compile:0 Compiler.dfy
…
real    0m39.471s

$ cp Compiler.dfy Compiler.v1.dfy # Create a second snapshot
$ time Dafny -verifySnapshots:3 -compile:0 Compiler.dfy
…
real    1m7.594s

Here, running with the cache, even on a single snapshot, doubles (!) our running time. Verifying the same file twice with caching enabled takes a bit less than three times as long as verifying the file just once. Ideally, we'd like all three of these benchmarks to be just as fast: turning on caching should not slow us down measurably, and reverifying should be instantaneous with a cache on. Not a great start for the cache.[^caching-incomplete]

[^caching-incomplete]: In fact, the cached run reports some errors (!!), partly because of what appears to be an incompleteness bug in the caching that I'm separately looking into.

Here is another simple way to measure how long Dafny spends in everything except calls to Z3:

$ time Dafny -compile:0 -noVerify Compiler.dfy
…
real    0m6.893s

$ cp Compiler.dfy Compiler.v0.dfy # Create one snapshot
$ time Dafny -compile:0 -verifySnapshots:3 -noVerify Compiler.dfy
…
real    0m9.612s

$ cp Compiler.dfy Compiler.v1.dfy # Create a second snapshot
$ time Dafny -compile:0 -verifySnapshots:3 -noVerify Compiler.dfy
…
real    0m20.582s

$ cp Compiler.dfy Compiler.v2.dfy # Create a third snapshot
$ time Dafny -compile:0 -verifySnapshots:3 -noVerify Compiler.dfy
…
real    0m29.502s

Here we see that caching incurs a 50% overhead (!) when running on a single example, but that thankfully it seems to scale (roughly) linearly. We also see that just the parsing, resolving, and typechecking stages, with no SMT solver involved, takes about 7s — that's a long time!

Yet another way to measure these things, if we worry about time spend generating Z3 formulae (which -noVerify doesn't do), is to run with a “no-op” verifier:

$ time Dafny -compile:0 -proverOpt:SOLVER=noop Compiler.dfy
…
real    0m11.804s

$ cp Compiler.dfy Compiler.v0.dfy # Create one snapshot
$ time Dafny -compile:0 -verifySnapshots:3 -proverOpt:SOLVER=noop Compiler.dfy
…

real    0m19.183s

So, more than half of the time that we spend on this example is spent before we send anything to Z3. Let's set the caching issues aside for now an focus on vanilla Dafny, and let's add a few more measurements into the mix. The flags we'll use are -compile:0, -noResolve, -dafnyVerify:0, and -noVerify (-noTypecheck is also supported, but it's effectively a synonym of -noResolve).

Additionally, since we're going to start measuring subtle effects, let's rerun each measurement a few times. I'm using multitime for this, which is a touch more convenient than a for loop[^windows], along with Boogie c33084fc Bump the patch version to 2.13.4 (#558) and Dafny 6c94b3420 chore: Exclude `*.pdb` files from the release packages (#2017).

[^windows]: There are no prebuilt multitime packages for Windows, but it builds fine in Msys2. By default it fails if the command returns a non-zero exit code, so I had to patch that part out to run some of the benchmarks below.

# Do nothing
$ multitime -q -s0 -n10 Dafny
            Mean                Std.Dev.    Min         Median      Max
real        0.187+/-0.0114      0.011       0.168       0.185       0.210
user        0.002+/-0.0045      0.004       0.000       0.000       0.015
sys         0.010+/-0.0069      0.007       0.000       0.015       0.015

# Parse only
$ multitime -q -s0 -n10 Dafny -compile:0 -noResolve Compiler.dfy
            Mean                Std.Dev.    Min         Median      Max
real        0.363+/-0.0104      0.010       0.357       0.359       0.392
user        0.000+/-0.0000      0.000       0.000       0.000       0.000
sys         0.010+/-0.0069      0.007       0.000       0.015       0.015

# Parse, resolve, typecheck, but don't translate
$ multitime -q -s0 -n10 Dafny -compile:0 -dafnyVerify:0 Compiler.dfy
            Mean                Std.Dev.    Min         Median      Max
real        0.961+/-0.0116      0.012       0.940       0.962       0.984
user        0.004+/-0.0069      0.007       0.000       0.000       0.015
sys         0.009+/-0.0074      0.007       0.000       0.015       0.015

# Parse, resolve, typecheck, translate, but don't generate VCs
$ multitime -q -s0 -n10 Dafny -compile:0 -noVerify Compiler.dfy
            Mean                Std.Dev.    Min         Median      Max
real        6.632+/-0.2381      0.238       6.226       6.672       6.985
user        0.003+/-0.0060      0.006       0.000       0.000       0.015
sys         0.012+/-0.0060      0.006       0.000       0.015       0.015

# Parse, resolve, typecheck, generate VCs, but don't verify them
$ multitime -q -s0 -n5 Dafny -compile:0 -proverOpt:SOLVER=noop Compiler.dfy
            Mean                Std.Dev.    Min         Median      Max
real        12.182+/-0.6601     0.366       11.563      12.361      12.587
user         0.000+/-0.0000     0.000       0.000       0.000       0.000
sys          0.012+/-0.0108     0.006       0.000       0.015       0.015

# A complete run
$ multitime -q -s0 -n5 Dafny -compile:0 Compiler.dfy
            Mean                Std.Dev.    Min         Median      Max
real        20.707+/-2.2891     1.269       18.480      20.895      22.247
user         0.000+/-0.0000     0.000       0.000       0.000       0.000
sys          0.009+/-0.0133     0.007       0.000       0.015       0.015

Bottom line: on this example, Dafny spends roughly 1 s parsing, typechecking, and resolving; 5 s translating to Boogie and resolving the result, 5 s generating VCs, and 8 s verifying the code.

Step 0.5: Why is Dafny so slow to start?

The Dafny test suite has roughly 1200 files. That means that we pay Dafny's startup cost at least 1200 times. So, let's see why Dafny is slow to start. Here's how long it takes for Dafny to do nothing:

$ multitime -q -s0 -n10 Binaries/net6.0/Dafny
            Mean                Std.Dev.    Min         Median      Max
real        0.191+/-0.0073      0.007       0.185       0.187       0.203
user        0.000+/-0.0000      0.000       0.000       0.000       0.000
sys         0.012+/-0.0060      0.006       0.000       0.015       0.015

Now, where is Dafny spending all that time? This is a good point to start using a proper profiler.

Step 1: Setting up a profiler

Most .Net profilers are proprietary. Ideally we'd like casual Dafny contributors to be able to reproduce these results, so let's use something free: dotnet-trace

$ dotnet tool install --global dotnet-trace

On Windows the usual tool to view the traces generated by dotnet-trace is PerfView, but since it's not available on macOS or Linux, let's use --format Chromium to get traces readable by Chromium's performance profiler instead (--Speedscope is also nice):

$ dotnet-trace collect --format Chromium -- Binaries/net6.0/Dafny.exe

The result is a file called Dafny.exe_20220415_121413.chromium.json. You can open it in Chromium by opening developer tools, navigating to profile, and clicking the up arrow (“Load profile”). In this profile there are two threads; that's because the first thing that Dafny does it to allocate a larger stack and fork:

public static int Main(string[] args) {
  int ret = 0;
  var thread = new System.Threading.Thread(
    new System.Threading.ThreadStart(() => { ret = ThreadMain(args); }),
      0x10000000); // 256MB stack size to prevent stack overflow
  thread.Start();
  thread.Join();
  return ret;
}

An easy win

Let's focus on the second thread, where the interesting work happens (note that the timings are wrong: it looks like dotnet-trace exports seconds where Chromium expects milliseconds, so all times are off by a factor 1000):

firstpic

What we see here is that Dafny spends roughly 70% of ThreadMain in Boogie, allocating a MemoryCache. This cache is not needed yet, though: we only make use of it if we actually turn on verification caching. it may be worth investigating why this MemoryCache class is so slow to start (since, in the long run, we'd like to have caching on by default), but for now let's confirm that this profile measured the right thing by delaying initialization of that cache:

--- a/Source/ExecutionEngine/VerificationResultCache.cs
+++ b/Source/ExecutionEngine/VerificationResultCache.cs
@@ -675,9 +675,10 @@ static internal class Priority
   }

-  public sealed class VerificationResultCache
-  {
-    private readonly MemoryCache Cache = new MemoryCache("VerificationResultCache");
+  public sealed class VerificationResultCache {
+    private MemoryCache cache = null;
+    private MemoryCache Cache =>
+      cache ??= new MemoryCache("VerificationResultCache");

Let's build again and measure:

$ time multitime -q -s0 -n10 Binaries/net6.0/Dafny
            Mean                Std.Dev.    Min         Median      Max
real        0.125+/-0.0021      0.002       0.123       0.125       0.131
user        0.002+/-0.0045      0.004       0.000       0.000       0.015
sys         0.009+/-0.0102      0.010       0.000       0.007       0.031

Not bad! From 0.187 to 0.125, that's roughly 30% of Dafny's execution time shaved off (aka a 1.5x speedup). After that the profile is quite a bit flatter, so let's leave it at that.

Profiling a typical run

Let's reuse our previous example:

$ dotnet build -c Release Source/Dafny.sln
$ dotnet-trace collect -o compiler+noVerify --format Chromium -- \
    Binaries/net6.0/Dafny -compile:0 -noVerify Compiler.dfy

Unfortunately, the result of this benchmarking run is not very helpful. Dafny and Boogie have both started using async pervasively not so long ago, and the result is that execution is spread (on my machine) across 8 threads. Part of this is true concurrency, but not all of it is (Dafny typechecks and verifies Boogie modules in parallel, but even before that phase there is quite a bit of jumping between threads that's simply an artifact of using async/await).

badconcurrency

There isn't much that C# as a language lets us do in this case, as far as I can tell. All our code is being converted to a collection of classes implementing state machines, and I could not find a way to turn that off and force the whole program to run synchronously on a single thread, as if there were no async calls.

So, instead, I just went ahead and removed every async and await from the code. That took about 20 minutes, and the non-SMT parts of Dafny and Boogie work just as well. The result is a much more readable trace:

goodconcurrency

There is still one issue when we zoom out: the execution of long-running tasks seems to be broken up into many relatively thin slivers:

enumerableissues

It is the result of the complicated control flow introduced by dotnet's enumerables: execution ping-pongs between the enumerable's code and the per-element code (another source for this sort of graph can be issues with the depth of the stack trace that the profiler is recording, and the solution is the same). The solution, in these cases, is to use a “left-heavy plot” that sorts stack traces, ignoring time — this gives a batter overall sense of where time is spent:

For Dafny, we get roughly this (click to zoom in):

leftheavy

What we see here is that we spend a bit less than one second (in turquoise) in resolution, then about two seconds (in blue and lavender) in translation to Boogie, and then a bit under six seconds in Boogie resolution and type checking. That's a significant overhead over the non-profiled run, so we'll have to be careful to double-check how well any potential optimization performs on an uninstrumented run.

Step 1: Can we optimize individual operations?

Unfortunately, no part of this trace highlights one specific low-level operation that would take all the time; instead we have a number of reasonable steps (resolution, type checking, translation to boogie, and a new round of resolution and type checking) each taking some time, with a few minor hotspots in each of them. We can spend some time optimizing, but (1) since these hotspots are at best a few percent of each run, even a 50% speedup on each of them will only yield a few percent worth of runtime improvement, and (2) speedups of a few percent are basically impossible to measure[^impossible].

[^impossible]: It's not just a matter of precision, it's a matter of confounding factors: changing a piece of code can have unexpected effects on other code in the same binary, such as changing the layout of the code, or its alignment, or cause functions to be reordered, etc. These unrelated changes often affect performance in measurable ways, and that noise makes it extremely hard to quantify the performance impact of a small change.

Really, it seems that a lot of Dafny's performance issues are due to small inefficiencies. This is a common pattern in mature software (if there was one giant algorithmic bottleneck in Dafny, someone would have found it by now). I had a look at improving Type.GetScope() and Type.NormalizeExpand(), but both of these are such a small part of the total runtime that improvements are hard to measure. So, let's move to step 2.

Step 2: Identify redundant work

Instead of optimizing an operation to run faster, let us see whether we could save time by reducing the amount of work that we perform. Are there functions that we call repeatedly with the same arguments? Passes that perform unnecessary work?

To find unnecessary calls, it's best to start with a very small example, so as to be able to understand what Dafny is doing. Let's take these three programs:

WrappedInt.dfy

  module WrappedInt {
    type T = int
    function method of_int(i: int) : T { i }
    function method to_int(t: T) : int { t }
  }

WrappedBool.dfy

  module WrappedBool {
    type T = bool
    function method of_bool(i: bool) : T { i }
    function method to_bool(t: T) : bool { t }
  }

Main.dfy

  include "WrappedInt.dfy"
  include "WrappedBool.dfy"

  module IntBag {
    import WrappedInt
    import WrappedBool

    datatype Bag = Empty | IBag(valid: WrappedBool.T, count: WrappedInt.T)
    function method size(b: Bag) : WrappedInt.T {
      WrappedInt.of_int
        (match b
           case Empty => 0
           case IBag(v, c) =>
             if WrappedBool.to_bool(v) then WrappedInt.to_int(c) else 0)
    }
  }

  module BoolBag {
    import WrappedInt
    import WrappedBool

    datatype Bag = Empty | BBag(count: WrappedInt.T, data: seq<WrappedBool.T>)
    function method size(b: Bag) : WrappedInt.T {
      WrappedInt.of_int
        (match b
           case Empty => 0
           case BBag(c, data) =>
             var i := WrappedInt.to_int(c);
             if i == 0 then |data| else i)
    }
  }

This is pretty redundant code, but it attempts to model the sort of module inclusion that we might find in a typical large Dafny program. Here is a benchmark:

$ multitime -q -s0 -n10 Binaries/net6.0/Dafny -compile:0 -noVerify Main.dfy
            Mean                Std.Dev.    Min         Median      Max
real        1.024+/-0.0101      0.010       0.999       1.028       1.035
user        0.002+/-0.0045      0.004       0.000       0.000       0.015
sys         0.010+/-0.0069      0.007       0.000       0.015       0.015

When running Dafny on this example, we find the following:

Dafny re-resolves previously resolved modules

Dafny does not use the same signature to compile a module and to verify it. As a result, it really creates two copies of each module: a regular copy and a copy called _Compile that is used only in the compiler.

To create _Compile modules, Dafny clones the original module, then swaps out a different signature and re-resolves the resulting module. For the program above, this is made evident by the fact that the Resolve method is called 10 times, not 5. This is our first sign of repeated work:

0.0 Dafny resolves every (non-abstract) module twice.

The translator runs once per verifiable module

To verify a file, Dafny currently collects all modules in that file (it calls them verifiable modules, as opposed to included ones) and creates one separate Boogie file for each of them. As part of creating the Translator object, Dafny reads and resolves Dafny's Boogie prelude (DafnyPrelude.bpl) and adds additional built-ins to the resulting module. Already, this hints to one source of repeated work:

0.0. Dafny's prelude is read from disk, parsed, resolved, typechecked, and translated to SMT once per module in the file being checked.

0.1. Dafny's built-ins are created, typechecked, and translated to SMT once per module in the file being checked.

For small programs, this can be significant: for module IntBag above Dafny generates a Boogie program with 813 top-level declarations. Of these, 507 are from DafnyPrelude.bpl; 211 are from Dafny's _System module (built-ins), and only 95 are specific to Main.dfy.

The exported parts of all modules (and then some) are retranslated once per verifiable module

Part of the translation process that is performed for each module consist in exporting (to Boogie) all declarations visible in the current scope, from all other modules. Thankfully we don't generate Boogie implementations for anything but the current module, but when a module is used in multiple places this is still a lot of redundant work.

1.0. Library modules are translated to Boogie, resolved, type-checked, and translated to SMT once per verifiable module that uses them

For IntBag above, we start with 507+211 = 718 definitions, then add 53 for IntBag itself, 21 for IntWrapper, and 21 for BoolWrapper. Hence in total we are doing 52 units of work specific to IntBag and 760 units of work that are at least shared across two modules.

Both this problem and the previous apply only for multi-module files… except when you consider that when running a language server (while editing in an IDE) each new round of verification reprocesses the (unchanged) prelude as well as all (most likely unchanged) dependencies.

In addition to resolving and typechecking, Boogie performs work that isn't relevant for Dafny

Boogie has support for reasoning about (a certain form of) concurrency. Dafny programs do not make use of that feature, but it still costs time to scan through the code and look for potential usages of the relevant attributes.

2.0. Boogie runs the Civl checker on Dafny's output

There is a similar problem that is trickier to address: Boogie includes some costly passes to handle certain syntactic constructs that Dafny creates very few instances of. For example, on some bechmarks Boogie spends 5% of its time traversing terms to look for lambda expressions, but there is a grand total of two places in Dafny where Boogie-level lambda expressions are created. Hence, a more general example of the above may be this:

2.1. Boogie completely traverses Dafny's output to look for things that may appear only a few times in it.

dafny-lang / dafny