Open cpitclaudel opened 2 years ago
Excellent write-up and nice findings :)
@cpitclaudel could you comment on what triggered you to make this analysis? What performance issues have you ran into when using Dafny?
could you comment on what triggered you to make this analysis?
The work on the compiler-bootstrap codebase: in that repo typechecking and translating to Boogie takes a lot longer than verifying with Z3, especially once caching is on. As a result the interactive experience is not good: every change leads to 10-20 seconds of waiting for results to come back.
What performance issues have you ran into when using Dafny?
I'm confused about this part of the question. The ones described above, right?
I'm confused about this part of the question. The ones described above, right?
You answered it, thanks!
Here are some WIP notes that I started writing up about Dafny performance during my last two on-calls. They are not complete, but hopefully they're a start as we consider where to focus optimization efforts, and I won't have much time to work on them more in the near future:
Step 0: Get an idea of where time is spent
We can test the cache's impact by turning on caching and verifying two snapshots:
Here, running with the cache, even on a single snapshot, doubles (!) our running time. Verifying the same file twice with caching enabled takes a bit less than three times as long as verifying the file just once. Ideally, we'd like all three of these benchmarks to be just as fast: turning on caching should not slow us down measurably, and reverifying should be instantaneous with a cache on. Not a great start for the cache.[^caching-incomplete]
[^caching-incomplete]: In fact, the cached run reports some errors (!!), partly because of what appears to be an incompleteness bug in the caching that I'm separately looking into.
Here is another simple way to measure how long Dafny spends in everything except calls to Z3:
Here we see that caching incurs a 50% overhead (!) when running on a single example, but that thankfully it seems to scale (roughly) linearly. We also see that just the parsing, resolving, and typechecking stages, with no SMT solver involved, takes about 7s — that's a long time!
Yet another way to measure these things, if we worry about time spend generating Z3 formulae (which
-noVerify
doesn't do), is to run with a “no-op” verifier:So, more than half of the time that we spend on this example is spent before we send anything to Z3. Let's set the caching issues aside for now an focus on vanilla Dafny, and let's add a few more measurements into the mix. The flags we'll use are
-compile:0
,-noResolve
,-dafnyVerify:0
, and-noVerify
(-noTypecheck
is also supported, but it's effectively a synonym of-noResolve
).Additionally, since we're going to start measuring subtle effects, let's rerun each measurement a few times. I'm using
multitime
for this, which is a touch more convenient than afor
loop[^windows], along with Boogiec33084fc Bump the patch version to 2.13.4 (#558)
and Dafny6c94b3420 chore: Exclude `*.pdb` files from the release packages (#2017)
.[^windows]: There are no prebuilt
multitime
packages for Windows, but it builds fine in Msys2. By default it fails if the command returns a non-zero exit code, so I had to patch that part out to run some of the benchmarks below.Bottom line: on this example, Dafny spends roughly 1 s parsing, typechecking, and resolving; 5 s translating to Boogie and resolving the result, 5 s generating VCs, and 8 s verifying the code.
Step 0.5: Why is Dafny so slow to start?
The Dafny test suite has roughly 1200 files. That means that we pay Dafny's startup cost at least 1200 times. So, let's see why Dafny is slow to start. Here's how long it takes for Dafny to do nothing:
Now, where is Dafny spending all that time? This is a good point to start using a proper profiler.
Step 1: Setting up a profiler
Most .Net profilers are proprietary. Ideally we'd like casual Dafny contributors to be able to reproduce these results, so let's use something free:
dotnet-trace
On Windows the usual tool to view the traces generated by
dotnet-trace
is PerfView, but since it's not available on macOS or Linux, let's use--format Chromium
to get traces readable by Chromium's performance profiler instead (--Speedscope
is also nice):The result is a file called
Dafny.exe_20220415_121413.chromium.json
. You can open it in Chromium by opening developer tools, navigating toprofile
, and clicking the up arrow (“Load profile”). In this profile there are two threads; that's because the first thing that Dafny does it to allocate a larger stack and fork:An easy win
Let's focus on the second thread, where the interesting work happens (note that the timings are wrong: it looks like
dotnet-trace
exports seconds where Chromium expects milliseconds, so all times are off by a factor 1000):What we see here is that Dafny spends roughly 70% of
ThreadMain
in Boogie, allocating aMemoryCache
. This cache is not needed yet, though: we only make use of it if we actually turn on verification caching. it may be worth investigating why thisMemoryCache
class is so slow to start (since, in the long run, we'd like to have caching on by default), but for now let's confirm that this profile measured the right thing by delaying initialization of that cache:Let's build again and measure:
Not bad! From 0.187 to 0.125, that's roughly 30% of Dafny's execution time shaved off (aka a 1.5x speedup). After that the profile is quite a bit flatter, so let's leave it at that.
Profiling a typical run
Let's reuse our previous example:
Unfortunately, the result of this benchmarking run is not very helpful. Dafny and Boogie have both started using
async
pervasively not so long ago, and the result is that execution is spread (on my machine) across 8 threads. Part of this is true concurrency, but not all of it is (Dafny typechecks and verifies Boogie modules in parallel, but even before that phase there is quite a bit of jumping between threads that's simply an artifact of usingasync/await
).There isn't much that C# as a language lets us do in this case, as far as I can tell. All our code is being converted to a collection of classes implementing state machines, and I could not find a way to turn that off and force the whole program to run synchronously on a single thread, as if there were no
async
calls.So, instead, I just went ahead and removed every
async
andawait
from the code. That took about 20 minutes, and the non-SMT parts of Dafny and Boogie work just as well. The result is a much more readable trace:There is still one issue when we zoom out: the execution of long-running tasks seems to be broken up into many relatively thin slivers:
It is the result of the complicated control flow introduced by dotnet's enumerables: execution ping-pongs between the enumerable's code and the per-element code (another source for this sort of graph can be issues with the depth of the stack trace that the profiler is recording, and the solution is the same). The solution, in these cases, is to use a “left-heavy plot” that sorts stack traces, ignoring time — this gives a batter overall sense of where time is spent:
For Dafny, we get roughly this (click to zoom in):
What we see here is that we spend a bit less than one second (in turquoise) in resolution, then about two seconds (in blue and lavender) in translation to Boogie, and then a bit under six seconds in Boogie resolution and type checking. That's a significant overhead over the non-profiled run, so we'll have to be careful to double-check how well any potential optimization performs on an uninstrumented run.
Step 1: Can we optimize individual operations?
Unfortunately, no part of this trace highlights one specific low-level operation that would take all the time; instead we have a number of reasonable steps (resolution, type checking, translation to boogie, and a new round of resolution and type checking) each taking some time, with a few minor hotspots in each of them. We can spend some time optimizing, but (1) since these hotspots are at best a few percent of each run, even a 50% speedup on each of them will only yield a few percent worth of runtime improvement, and (2) speedups of a few percent are basically impossible to measure[^impossible].
[^impossible]: It's not just a matter of precision, it's a matter of confounding factors: changing a piece of code can have unexpected effects on other code in the same binary, such as changing the layout of the code, or its alignment, or cause functions to be reordered, etc. These unrelated changes often affect performance in measurable ways, and that noise makes it extremely hard to quantify the performance impact of a small change.
Really, it seems that a lot of Dafny's performance issues are due to small inefficiencies. This is a common pattern in mature software (if there was one giant algorithmic bottleneck in Dafny, someone would have found it by now). I had a look at improving
Type.GetScope()
andType.NormalizeExpand()
, but both of these are such a small part of the total runtime that improvements are hard to measure. So, let's move to step 2.Step 2: Identify redundant work
Instead of optimizing an operation to run faster, let us see whether we could save time by reducing the amount of work that we perform. Are there functions that we call repeatedly with the same arguments? Passes that perform unnecessary work?
To find unnecessary calls, it's best to start with a very small example, so as to be able to understand what Dafny is doing. Let's take these three programs:
WrappedInt.dfy
WrappedBool.dfy
Main.dfy
This is pretty redundant code, but it attempts to model the sort of module inclusion that we might find in a typical large Dafny program. Here is a benchmark:
When running Dafny on this example, we find the following:
Dafny re-resolves previously resolved modules
Dafny does not use the same signature to compile a module and to verify it. As a result, it really creates two copies of each module: a regular copy and a copy called
_Compile
that is used only in the compiler.To create
_Compile
modules, Dafny clones the original module, then swaps out a different signature and re-resolves the resulting module. For the program above, this is made evident by the fact that theResolve
method is called 10 times, not 5. This is our first sign of repeated work:0.0 Dafny resolves every (non-abstract) module twice.
The translator runs once per verifiable module
To verify a file, Dafny currently collects all modules in that file (it calls them verifiable modules, as opposed to included ones) and creates one separate Boogie file for each of them. As part of creating the
Translator
object, Dafny reads and resolves Dafny's Boogie prelude (DafnyPrelude.bpl
) and adds additional built-ins to the resulting module. Already, this hints to one source of repeated work:0.0. Dafny's prelude is read from disk, parsed, resolved, typechecked, and translated to SMT once per module in the file being checked.
0.1. Dafny's built-ins are created, typechecked, and translated to SMT once per module in the file being checked.
For small programs, this can be significant: for module
IntBag
above Dafny generates a Boogie program with 813 top-level declarations. Of these, 507 are fromDafnyPrelude.bpl
; 211 are from Dafny's_System
module (built-ins), and only 95 are specific toMain.dfy
.The exported parts of all modules (and then some) are retranslated once per verifiable module
Part of the translation process that is performed for each module consist in exporting (to Boogie) all declarations visible in the current scope, from all other modules. Thankfully we don't generate Boogie
implementations
for anything but the current module, but when a module is used in multiple places this is still a lot of redundant work.1.0. Library modules are translated to Boogie, resolved, type-checked, and translated to SMT once per verifiable module that uses them
For
IntBag
above, we start with 507+211 = 718 definitions, then add 53 forIntBag
itself, 21 forIntWrapper
, and 21 forBoolWrapper
. Hence in total we are doing 52 units of work specific toIntBag
and 760 units of work that are at least shared across two modules.Both this problem and the previous apply only for multi-module files… except when you consider that when running a language server (while editing in an IDE) each new round of verification reprocesses the (unchanged) prelude as well as all (most likely unchanged) dependencies.
In addition to resolving and typechecking, Boogie performs work that isn't relevant for Dafny
Boogie has support for reasoning about (a certain form of) concurrency. Dafny programs do not make use of that feature, but it still costs time to scan through the code and look for potential usages of the relevant attributes.
2.0. Boogie runs the Civl checker on Dafny's output
There is a similar problem that is trickier to address: Boogie includes some costly passes to handle certain syntactic constructs that Dafny creates very few instances of. For example, on some bechmarks Boogie spends 5% of its time traversing terms to look for lambda expressions, but there is a grand total of two places in Dafny where Boogie-level lambda expressions are created. Hence, a more general example of the above may be this:
2.1. Boogie completely traverses Dafny's output to look for things that may appear only a few times in it.