flyvy-verifier / flyvy

An experimental framework for temporal verification based on first-order linear-time temporal logic. Our goal is to express transition systems in first-order logic and verify temporal correctness properties, including safety and liveness.
BSD 2-Clause "Simplified" License
14 stars 1 forks source link

Benchmarking infrastructure improvements #153

Open tchajed opened 1 year ago

tchajed commented 1 year ago

As suggested by @edenfrenkel:

Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for verify, the output associated with fail might be of interest. For this we could have

More flexibility in configuring benchmarks: Currently this uses all examples in the examples folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)

tchajed commented 1 year ago

At least capturing and saving the stdout and stderr is a good idea to implement first, and it would support other extensions (like recording custom statistics). I'm not so sure about storing the output or adding logging since this infrastructure was meant only to run and record a bunch of benchmarks - if you want to debug a particular run, that should be doable by running it directly. What might be useful is a way to just print the command line invocation for every benchmark in the suite.

I definitely intend for benchmarks to be configurable. The library already supports this, and as a first cut you could manually specify a set of benchmarks, especially for the inference code which needs sort configurations that are specific to each example.