Add verification loggers and report generator

robin-aws commented 2 years ago

This change adds a best practice we recommend for all Dafny projects: measuring the cost of each individual verification task and setting an upper bound in CI, to guard against future verification instability (see https://github.com/dafny-lang/dafny/issues/1582).

This is done with two steps:

Add the /verificationLogger:csv parameter when invoking dafny, which outputs a CSV file with verification results.
Invoke the dafny-reportgenerator tool with a configured maximum cost, which will fail the build if any task's cost crosses that threshold.

There are more details on the latter tool here: https://github.com/dafny-lang/dafny-reportgenerator

The good news is that (at least after #36), this codebase is in good shape! :) All tasks take less than 5 seconds across a few runs, so I've set 10 seconds as the upper bound for now. I would prefer to bound the resource count instead since that is a more predictable metric, but Dafny doesn't record the resource count when splitting happens until the upcoming 3.5 release.

Note that because the /verificationLogger options cause extra output, I've also changed the lit configuration to no longer assert the exact output of dafny when verifying everything. This will make the build more stable by not depending on the exact console output, as all of the code here is always expected to verify successfully.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

cpitclaudel commented 2 years ago

Invoke the dafny-reportgenerator tool with a configured maximum cost, which will fail the build if any task's cost crosses that threshold.

Naive question: why is the bounding done by analyzing the CSV after the fact, rather than just passing timeLimit? Is that because /timeLimit doesn't work at the task level? Or because /timeLimit sometimes just doesn't work?

keyboardDrummer commented 2 years ago

Invoke the dafny-reportgenerator tool with a configured maximum cost, which will fail the build if any task's cost crosses that threshold.

Naive question: why is the bounding done by analyzing the CSV after the fact, rather than just passing timeLimit? Is that because /timeLimit doesn't work at the task level? Or because /timeLimit sometimes just doesn't work?

I have a related question. If the report-generator is something customers should always use for CI, then isn’t it better to have it be part of the dafny CLI ?

robin-aws commented 2 years ago

Invoke the dafny-reportgenerator tool with a configured maximum cost, which will fail the build if any task's cost crosses that threshold.

Naive question: why is the bounding done by analyzing the CSV after the fact, rather than just passing timeLimit? Is that because /timeLimit doesn't work at the task level? Or because /timeLimit sometimes just doesn't work?

I tried to address that in the README for the tool here (https://github.com/dafny-lang/dafny-reportgenerator): "This is better than setting a more aggressive verification cost bound through options like /timeLimit directly, as it allows users to know that their code is still correct, but still blocks code changes that are too expensive to verify and hence likely to break in the future."

robin-aws commented 2 years ago

I have a related question. If the report-generator is something customers should always use for CI, then isn’t it better to have it be part of the dafny CLI ?

Fair question, and especially with the limited functionality the tool is providing so far you could imagine it just being yet more options on the dafny CLI. But we expect that it will gain more functionality over time, and the kind of features that can be strongly decoupled from the core functionality of the dafny tool. It's also very useful to record the results of verification tasks once and then analyze in multiple ways after the fact, whereas if you had to invoke dafny repeatedly you could get different non-deterministic results (at least in terms of duration).

For precedent, the .NET platform also has a similar architecture, where data is written out as a result of options like --logger (for test results) and --collect (for things like testing coverage), and other tools are used to report on that data in various ways.

keyboardDrummer commented 2 years ago

But we expect that it will gain more functionality over time, and the kind of features that can be strongly decoupled from the core functionality of the dafny tool.

What would those features be? When I think of a feature like detecting proofs with a high verification time variability, that seems like something useful to have in the Boogie CLI.

It's also very useful to record the results of verification tasks once and then analyze in multiple ways after the fact, whereas if you had to invoke dafny repeatedly you could get different non-deterministic results (at least in terms of duration).

What do you mean by analyse in multiple ways? Do you mean take the measurements from different proof runs and apply different statistics to them? I imagine in the general case it would be fine to have Boogie apply a common statistic after which you wouldn't want to do any further analysis, which you'd use in your CI.

For precedent, the .NET platform also has a similar architecture, where data is written out as a result of options like --logger (for test results) and --collect (for things like testing coverage), and other tools are used to report on that data in various ways.

I'm all for other tools using Dafny's output, but I think that any Dafny related features customers should use in their CI should be in the Dafny CLI.

dafny-lang / libraries

Add verification loggers and report generator #38