Improve the output of clerk test

denismerigoux commented 3 months ago

As of now, the typical output of clerk test <folder< is :

[27/27] <test> 'tests@test'

[PASS] tests:   5/5

Where 5 is the number of files containing one or more tests found inside . If there is a test failure, then what is shown is:

[21/27] <test> tests/benefices_non_commerciaux.catala_fr
--- reference
+++ current-output
@@ -159,7 +159,7 @@
 ┌─[RESULT]─
 │ sortie =
 │   Impot_revenu.BénéficesNonCommerciauxFoyerFiscal {
-│     -- résultats_liquidatin_bénéfices_non_commerciaux:
+│     -- résultats_liquidation_bénéfices_non_commerciaux:
 │       [
 │         Impot_revenu.BénéficesNonCommerciauxDéclarant {
 │           -- abattement_forfaitaire_micro_professionnel: 0,00 €
[27/27] <test> 'tests@test'
FAILED: tests@test 
out='tests@test' ; success=$( tr -cd 0 < '_build/tests@test' | wc -c ) ; total=$( wc -c < '_build/tests@test' ) ; pass=$( ) ; if test "$success" -eq "$total" ; then printf "\n[\033[32mPASS\033[m] \033[1m%s\033[m: \033[32m%3d\033[m/\033[32m%d\033[m\n" ${out%@test} $success $total ; else printf "\n[\033[31mFAIL\033[m] \033[1m%s\033[m: \033[31m%3d\033[m/\033[32m%d\033[m\n" ${out%@test} $success $total ; return 1 ; fi

[FAIL] tests:   4/5

However, because this command is the primary testing method we recommend for a typical Catala workflow, the output of the command should be improved to look better and provide more accurate information. Here is a list of improvements that could be made :

instead of displaying 5/5, it should display 37/37 tests across 5 files
when there is a test failure, the output should be a clean listing of all tests that have failed, grouped by file, and not just the first test that failed.

I suspect the relevant code to tweak is here for these improvements :

https://github.com/CatalaLang/catala/blob/e7853d69cf1f258142ef6d23a0bdd083d7e2d14e/build_system/clerk_driver.ml#L564-L566

https://github.com/CatalaLang/catala/blob/e7853d69cf1f258142ef6d23a0bdd083d7e2d14e/build_system/clerk_driver.ml#L580-L600

AltGr commented 3 months ago

Better test output

As you can see in the last chunk of code you linked, the way to report the status is quite ugly: since running the tests is handled by ninja, we just use a special rule at the end for reporting the results. At the moment, this rule is just a short shell snippet ; this can be seen in _build/clerk.ninja after running clerk in debug mode:

rule test-results
  command = out=${out} ; success=$$( tr -cd 0 < ${in} | wc -c ) ; total=$$( wc -c < ${in} ) ; pass=$$( ) ; if test "$$success" -eq "$$total" ; then printf "\n[PASS]$ %s:$ %3d/%d\n" $${out%@test} $$success $$total ; else printf "\n[FAIL]$ %s:$ %3d/%d\n" $${out%@test} $$success $$total ; return 1 ; fi
  description = <test> ${out}

⇒ A better way to handle this would be to implement a clerk report internal subcommand (we already have clerk runtest in this category) that could do that more cleanly with OCaml code, and gets called by this rule.

We don't have at the moment the information about how many tests there were in each file though: testing proceeds in 4 steps:

run clerk runtest on a file to generate the output file (whatever the number of tests in it, it will just run catala that many times)
diff the original file and the output file to
- determine success or failure
- print the diff in case of failure
this is done by the post-test ninja rule. The output code (0 or 1) is written in a filename@test file for tracing failures
gather test results for final reporting. again this is done by a very simple ninja rule that works on directories and just concats the @test files recursively
count success/failures on the resulting file in the top directory and report (that's what the ugly rule above does)

The "generating output + then diffing" scheme has the merit of being simple and decoupling things well ; but if we want finer reporting on individual tests within the same file, we'll have to reimplement diffing directly into clerk runtest and merge this steps together, so it's not a trivial change. Adding more information in the intermediate @testfiles wouldn't be difficult though once we can use OCaml to process them.

A quick placeholder could be to count the hunks in the patch but that'll always be very approximative.

Reporting diff

when there is a test failure, the output should be a clean listing of all tests that have failed, grouped by file, and not just the first test that failed.

This, on the other hand, is expected to already be the case. Could you point out the bug in more detail if you find it is not ? (Well, it would be the diff of each file that contains failed tests, but it should be fairly close, and maybe more concise)

AltGr commented 3 months ago

Conclusions of a short discussion with @denismerigoux :

implement clerk report 👍🏿
add diffing capabilities to clerk runtest, and:
1. report a 1-liner message right away on individual test failure
2. store a more detailed report (for each file: list of tests, their command-line, and the corresponding local diff on failure)
per-directory gathering of test results should probably just list the report files at this point, clerk report will read them individually.
advanced clerk report will leverage these detailed reports to list tested files in a predictable order, and provide several verbosity levels (from total count of failures/tests/files to list of detailed list of tests per file and their individual status)

CatalaLang / catala

Improve the output of clerk test #623

Better test output

Reporting diff