Improving code coverage for interpreted code

fingolfin commented 4 years ago

Short version: Coverage of interpreted and coded/executed code is not consistent with each other, and we should probably come up with some good rules for what we want to consider as "code" and what not before making any attempts to rectify this.

Long version: I noticed that PR #3839 reduced coverage, and started to investigate. According to Codecov, it removes about 6000 lines of code (in the sense that lines which were recognized as code before are no longer), most of which were considered as "executed", hence overall coverage actually drops.

I don't claim to have studied this in detail yet, but a major source of this seems to be that the end in an interpreted function() ... end expression now is generally not considered as code, while before, it usually (but not always!) was registered as such. One can see this by e.g. comparing the following two side by side:

If you look at line 283 (almost at the end of the file), you see an end that is not marked as code in both. And I think this gives a hint at what's going on, namely I believe there is a general deficiency in the way we track interpreted code lines right now: based on some experiments I made, my impression is that often a line is marked as code only "by accident", during the execution of the next "line of code" / next statement. I believe that is/was the case for the end here as well, which is why it wasn't covered when end was the last code in the file. But at this point, this is a conjecture, as I did not yet actually debug this deeper

Let me point out that it is not even clear to me whether this change is a bug or a feature; i.e., is the changed behavior a regression, or an improvement (or perhaps neither)? After all, the end doesn't really do anything, and so arguably is not "executed".

That said, it certainly is inconsistent with how coverage works in coded (not interpreted) code, as can be seen here, as end is marked as code there, both before and after:

Yet sometimes, end is marked as code even with this PR, see e.g. init.g:167, so that is a stroke against the behaviour in this PR whatever it is, it should at least be consistent). Of course, as my example above shows, things are not consistent wrt to end already in master.

And there are also things were coverage arguably improved with this PR, cf. these (and in addition, also compare the last line of the two files)

In the same file, we see a method whose body appears to be "covered" in master, but I think (but have not verified) that this actually is a fluke, caused by the statement being closed on the same line; this is visible as non-executed code with this PR (and I think this would be an argument for not covering lines as "code" if they only contain isolated keywords like if, end, );

There are more inconsistencies when looking at else, fi, function and probably more, e.g.:

function(...) is not marked as code, both with and without this PR, both in interpreted and coded sections
else is nor marked as code when coded, but is marked as code when interpreted; while fi is always marked as code: compare init.g:190 vs init.g:112

All in all, I think with the current approach for tracking interpreted code, we'll never quite resolve this; we need to use a more sophisticated approach for that, just hooking into Match is not enough. For example, if I write:

x
:=
# Comment
  1
  +
  2
  ;

Then which lines should be marked as code? I would argue all except for that with the comment, and the one with the semicolon: because the other each trigger some kind of computation, being the lookup of an identifier x (is it lvar/hvar/gvar/dvar?), executing an assignment, immediate constant 1, performing addition.

Of course this is highly artificial code, nobody would write it like that (I hope). My goal here is not to get coverage for such artificial code right, but rather to determine if we can come up with some rules what should be considered code and what not; only then can we try to implement those rules (the needs and issues of that might then force us to change those rules, but my point here is that until we agree what we want to be counted as code and what not, it's impossible to fully resolve this situation).

Here is another example:

if  # code
  cond  # code
then 
  return  # code
    42  # code
    ;
fi
;

I marked all lines I'd consider as "code". IMHO then, ; and fi are just "delimiters". This is supported by languages like Python or Lisp needing none of them. Although arguably Python substitutes then by : plus a newline; and Lisp uses parenthesis to group things; but my point stands that these devices simply exist to make clear what belongs where, but don't directly translate to instructions. Of course that's not a really rigorous argument, I am simply trying to develop some heuristics.... One reason for not wanting to treat fi, end, od, ) as code is that it avoids issues like in grpfp.gi:90 which is marked as covered but shouldn't be, and I suspect the ) ); on the same line is the culprit, though I am not sure.

Anyway, this line of thinking also justifies why local ...; is not marked as code. I am somewhat torn on function(...), but I am now leaning towards saying it should be marked as code, just like if and repeat and while.

Looking at coverage for our C code, we see that { and } are not marked as code there either, nor are simply variable declarations like int i. Note that the start of a function like static void InstallZeroMutObject ( Int verb ) (from ariths.c:137) is marked as code, which is consistent with my idea of treating function(...) as code, too.

Conveniently, these rules mean that PR #3839 is "correct" in the way it changes coverage :-) but that is not the motivation for my reasoning. I am absolutely willing to rework PR #3839 as necessary.

ChrisJefferson commented 4 years ago

I am interested in what people would like the output to be, and how unhappy they are with the current output. There are various things, which are various levels of difficult to implement, but it would be nice to start looking at what is actually wanted.

In the past my main concern was having lines which were impossible to turn green, because due to how GAP worked some lines ended up marked as "read" but never "executed". I believe these are all now fixed, but there is still plenty of weirdness.

fingolfin commented 4 years ago

@ChrisJefferson I agree that coverage data is reasonably good now (a big "thank you" to you for that!). But as outlined it could be better, and indeed the first step should be to "start looking at what is actually wanted", as you put it. Which is why I outlined several suggestions for that above. Would be great to hear your thoughts on that. Esp. the two short code snippets where I state which of them I'd like to see marked as code and which not. We can then argue about that ;-)

gap-system / gap

Improving code coverage for interpreted code #3841