Benchmarking infrastructure improvements

As suggested by @edenfrenkel:

Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for verify, the output associated with fail might be of interest. For this we could have

Some dedicated folder (like we have .flyvy-log for .smt2 files) where the output of benchmark runs is saved to. We can also have a flag that determines whether the output should be saved or discarded.
Once the above is done, we might want some control over the level of logging in the output, e.g. using a --log level which would set RUST_LOG=level (or not set any logging if not provided).
Finally, there should be a way to access this output from the benchmark code, and to define per-benchmark statistics to present in the printed table. Maybe this should be done by providing a path to the saved file. For example, when I write the benchmarking for qalpha I might want to print the size of the fixpoint found, which I would extract from the output and then add to the table. Note that we probably want the saved/processed output to contain both stdout and stderr, because I think the logging is sent to stderr.

More flexibility in configuring benchmarks: Currently this uses all examples in the examples folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)

At least capturing and saving the stdout and stderr is a good idea to implement first, and it would support other extensions (like recording custom statistics). I'm not so sure about storing the output or adding logging since this infrastructure was meant only to run and record a bunch of benchmarks - if you want to debug a particular run, that should be doable by running it directly. What might be useful is a way to just print the command line invocation for every benchmark in the suite.

I definitely intend for benchmarks to be configurable. The library already supports this, and as a first cut you could manually specify a set of benchmarks, especially for the inference code which needs sort configurations that are specific to each example.

flyvy-verifier / flyvy

Benchmarking infrastructure improvements #153