Increase verification of 3C output in benchmark tests

mattmccutchen-cci commented 3 years ago

Currently, the benchmark tests only check that 3C completes successfully (exit code 0). We should add more verification of the output. Ideas:

Without -alltypes, the output should compile.
With -alltypes, we could compare the output to a set of "reference" output files we maintain somewhere (similar to the manual evaluation procedure in the 3C document that we haven't used in a while). We have to think about how to make this useful: just having the tests fail every time the output changes may not be useful enough to justify the hassle.

Another idea from Mike: Compare the count of each kind of checked pointer (similar to Tables 2-4 of the 3C paper submission (I won't paste the capability-granting link here)) and the number of Checked C type checking errors between the actual output and the reference output. If the count of checked pointers increases and the number of compile errors doesn't, that's probably good. We'd still have to decide how to translate that to a pass/fail status or how else to ensure we remember to monitor the quality of conversion.

john-h-kastner commented 3 years ago

Since many of the tests depend on the convert-project.py, we'll probably want to handle at least part of #329 at the same time.

mattmccutchen-cci commented 3 years ago

Update:

Since correctcomputation/actions#7, the workflow tests that the non -alltypes output compiles.
Since correctcomputation/actions#24, the workflow tests that the -alltypes output compiles except for some error message templates that we've designated as "known bounds inference limitations". However, we should still do more to assess the quality of the conversion.

Unfortunately, both of these tests are currently far from passing. There is a long tail of 3C issues affecting the benchmarks, and fixing them hasn't been our highest priority.

correctcomputation / checkedc-clang

Increase verification of 3C output in benchmark tests #428