Seelengrab commented 1 year ago

Code Coverage

Code coverage is a very interesting field, with lots of different variations. The following is loosely inspired by some literature I stumbled across years ago - I'll have to see if I can find it again. The ideas are definitely not my own, though I believe the julia specific application is.

Note: This issue was originally a comment in this issue, but seeing as this is a more general wishlist/thing that would be nice to have, I thought it would be a good idea to track this seperately. This issue is not exactly the same as the linked comment though, so please do read through this thoroughly! If you think this is more appropriate as a discussion instead of an issue, please do feel free to convert this issue into one.

Motivation/Background

The larger motivation for this is in part that code coverage in Julia has historically had quite a lot of issues, ranging from tracking wrong source lines, over reporting wrong coverage even though a line must have been passed through to changing effects of called functions and inhibiting optimizations. This seems unnecessary, and especially problematic when looking at other languages and how well these kinds of software engineering tools help inform & establish good development practices.

Generally, each of the following list has either full, partial, or no coverage. Full coverage means "every variation of this is covered by some test code", partial coverage means "some variation of this is covered by some test code" and no coverage means "no variation of this is covered by some test code".

Line coverage
- Is this line of source code covered by some test/actual execution?
Expression coverage
- Is this expression (and its subexpressions) covered by some test/actual execution?
Branch coverage
- Is this branch covered by some test/actual execution?
Path coverage
- Is a given path through the program, from call to return, covered by some test/execution?
- This is notably different from branch coverage - there are situation where you can test each individual branch in isolation and reach full branch coverage, but don't have full path coverage, which can hide problems. For example, think of a function with two if/else blocks following one another, without being nested. You can reach full branch coverage with just 2 inputs, but path coverage requires at least 4 in the general case, hitting every combination of if and else block.
- This is usually sufficient and can be reached by segmenting the input/output space into equivalence classes for the purpose of testing.
Input/Output space coverage
- Is a given subregion of the full input argument space covered by some test/execution?
- Reaching full input/output space coverage shouldn't be the goal of any serious software project; that's just wasting CPU cycles (you can do it on some small functions taking a small number of arguments, like a single UInt32 for example)
- This is mostly achievable for small types and types with fixed number of known values, like enums.
The list is ordered from "least trustworthy" to "most trustworthy", in terms of implications about correctness of the tested piece of code. If you have full coverage (I think sometimes also called "total" in the literature, but don't quote me on that) on one level of the list, it means that you must also have full coverage on the levels above. Partial coverage only implies partial coverage up the list (though full coverage is allowed - for example, it's extremely common to only have partial Input/Output space coverage, yet you reach full line coverage trivially). No coverage on any level means that you can't have coverage on any other level here either.

Julia currently only reports (fuzzy) line coverage, which is extremely easy to reach with a bit of effort (for example, the package FieldFlags.jl has 100% coverage, yet I only consider it moderately well tested because Julia doesn't report functions that aren't called at all as uncovered). So from my POV, "making coverage better" means being able to emit more granular information about what isn't covered, as well as actually reporting uncalled functions as uncovered, thereby reducing reported coverage percentages (because we're moving down the list of coverage guarantees). To get some of these levels, our source code needs to at least track columns as well as lines, otherwise we only get a subset of expressions, branches or paths that could be covered (think ternary expressions, for example - line coverage decidedly does not imply expression or branch coverage).

The way coverage (on master) works (or at last, as far as I can tell) is by tainting the :sideeffectfree effect of any called code, because when tracking coverage, optimizing code by deleting dead branches has the "sideeffect" of changing the observed coverage number. I don't think that should happen; any code that's not actively looking at coverage data must return the same values it did before (assuming the code is otherwise free of sideeffects and RNG and so on). Further, coverage is a diagnostic global property of a program, not a property inherent to a function; tracking it through tainting of :sideeffectfree (or even a different effect) clashes with the meaning of an effect. Here's an example:

h(x) = iseven(x) ? "string" : 1

g() = h(2)

A priori, asking whether any part of this code is "covered" doesn't really make sense - there is no call to either g or h at the top level, and nothing is executed. So nothing can be covered because nothing runs that could produce coverage data. There is an argument to be made that the compiler is free to evaluate h(2) at compile time and thus the question is, should it produce coverage data for that call? My answer to that is "yes, but it doesn't necessarily need to report that as true coverage of either g or h, because no call causing the h(2) call actually occured in the (nonexistent) testsuite". The thinking is this: since there is no top level call to g(), whether or not h(2) is covered is unknown - the compiler can already compute the coverage data during constant folding (it found a valid path to do that in the first place after all), but it must not report it to the user as "this branch is covered", because that would leak the implementation detail/optimization of constant folding.

One solution is to of course disable constant folding when emitting coverage, but that then implies the issue about whether or not constant folding is part of observable API at some point, and how to test for that if it's actually required for some other reason (like compiling away an abstraction to build a nice API, which julia is extremely good at - it would be a shame not to be able to test for that in the future!). On top of this, disabling compiler optimizations in the presence of coverage tracking means that CI runs (which by default do enable coverage) must run for much longer (or twice, once with coverage and once without), simply because the code may have been engineered with performance & optimizations in mind.

Of course, code that is actively looking at coverage data should have its :consistent effect tainted; its behavior obviously must change as soon as the coverage data of the inspected function changes.

Potential solutions

Another solution is to keep track of coverage data (in however granular form we want to [1]), but only "merge" it with the parent call when actually requested/a top level call is done that inquires about the coverage of its subexpressions. An additional benefit of this approach is that we can ask the opposite question - where do I need to write more tests to increase coverage, by making top level calls that would cover more subexpressions that are not yet covered in some capacity by another call?

I think this can actually already work with abstract interpretation by keeping track of covered line data through calls & expressions, bubbling coverage data upwards. A function is fully covered if all its lines/subexpressions are covered - if some are covered it's partially covered and if none it's not covered (you get the idea). Admittedly I don't have an implementation of that idea. The difficulty of course is that there's quite a lot of additional metadata one would like to get access to when asking about coverage - it's not impossible to think that Julia could directly produce a "coverage percentage" (how bad of a metric that is) itself, using that method.

Nonetheless, these potential solutions are only conceptual at the moment, and may not be appropriate/a good way to implement a good coverage mechanism.

The Wishlist

So, in summary, these are the kinds of features I'd love to have with better coverage:

Expr based coverage reporting
Coverage reporting not influencing optimizations like dead code elimination, inlining, constant folding, semi-concrete evaluation etc.
Not tainting effects that are unrelated to code coverage
Exposing an API for querying coverage data dynamically
A clearer distinction between Not covered/partially covered/uncovered, as we currently only have a binary designator for that
Coverage reporting of functions that are not called in the codebase
Code coverage of code generated by macros
- This is likely very tricky, because a macro could produce entirely different code at every invocation. There would have to be some kind of double mapping/tracking of where an expression comes from, beyond just LineNumberNodes.

And these are the kinds of things these features would enable:

Coverage guided fuzzing, e.g. with PropCheck.jl or Supposition.jl
- PropCheck.jl currently just generates 100 test cases (Supposition.jl does 10_000) and then stops checking for new ones. Instead, there could be a mode that generates test cases until the function being fuzzed is sufficiently covered, providing much better fuzzing capabilities than a "blind shot" currently does.
Coverage reporting for guiding which functions need more tests
- Similarly to what PropCheck.jl/Supposition.jl could do, but for reporting which branches or even value regions are currently uncovered by tests. See e.g. https://ieeexplore.ieee.org/stampPDF/getPDF.jsp?tp=&arnumber=9152719 for a version of coverage guided fuzzers
Pretty visualizations!
Better error reporting
With dynamic coverage querying, it ought to be possible to have the following flow (or something like it, depending on how granular the data we get is):

julia> f(x::Int) = iseven(x) ? "foo" : "bar"

julia> coverage(f, (Int,), :line)
:uncovered # and similar for all other variants of coverage

julia> f(2)
"foo"

julia> coverage(f, (Int,), :line)
:full

julia> coverage(f, (Int,), :expr)
:partial

julia> f(1)
"bar"

julia> coverage(f, (Int,), :expr)
:full

julia> coverage(f, (Int,), :value)
:partial # because we can't claim full coverage if we only test two values out of all `Int`

Others

There may be others that arise in a discussion in the comments; I will add them to the list above.

Potential problems

`rand`

Part of this issue came out of a long discussion I've had with @oscardssmith on Slack, in particular in regards to examples where tracking coverage is tricky due to cases involving some RNG. Here are some points raised, and how I think they fit into this wishlis:

function f()
    x = y = 1
    if rand() < .5
        x = 5
    elseif x == 1
        y = 2
    end
    return x^y
end

In any given invocation of f, either branch could be taken, so any given invocation can only ever report partial coverage. However, that has no bearing on whether e.g. x^y is covered; it is always fully (:line and :expr) covered as soon as any call is made. Further, this function by itself already cannot be constant folded, because rand() is not :consistent; tracking coverage on top of that shouldn't change that. Nevertheless, I don't think this actually stands in the way of seperating coverage tracking from :effect_free tracking, because tracking coverage of f has no influence whatsoever on the behavior of f.

Effects of inspecting coverage

Inspecting coverage data about a function must certainly taint :consistent of the function/code/expression doing the inspecting, at least because the result of coverage can & should change dynamically (as in the REPL example above). However, this can be tricky for something like this:

# s being from `:line` etc.
function foo(s::Symbol)
    coverage(foo, (Symbol,), s)
end

What should the effects of this function be? Evidently, as soon as you call foo, you should always get :full coverage back. So it should be :consistent. However, before you make that call, the function has no coverage yet, so it cannot return :full and thus isn't :consistent. Arguably, such edge cases of self-inspection ought to just be disallowed entirely. It may be difficult to achieve that in practice, but since this is a diagnostic query, I think it's fair game to have it be slow.

Others

There may be others that arise in a discussion in the comments; I will add them to the list above.

[1]: e.g. for constant foldable expressions, Julia must already have perfect knowledge of line, expression, branch, path and input/output space coverage, because it was able to execute the code at compile time! This doesn't necessarily imply that the result of that is full coverage, but does imply partial coverage for all of these.

nsajko commented 1 year ago

Related: #50044?

Seelengrab commented 1 year ago

Possibly, yes! The question is how this can be integrated with the dynamic nature of coverage querying and how this then interacts with effects.

Krastanov commented 1 year ago

Concerning macros, it would be really nice to have some clarity on how LineNumberNodes are being used to track coverage. As a macro writer, I am happy to be told that it is my responsibility to annotate my generated code so that coverage is reported, but I have no idea how to do that currently.

I have some relatively trivial examples. See the coverage report below for instance. It is obvious that removing the LineNumberNodes makes coverage impossible to track (f_make_3) and that moving those nodes around makes the coverage report the wrong numbers (f_make_4)

        1 macro make_a_new_expr_2(expr)
        1     new_expr = Expr(:function, Expr(:call, :f_make_2), Expr(:block))
        1     for i in expr.args[end].args
        7         push!(new_expr.args[end].args, i)
        7     end
        1     new_expr
        - end
        - 
        1 macro make_a_new_expr_3(expr)
        1     new_expr = Expr(:function, Expr(:call, :f_make_3), Expr(:block))
        1     for i in expr.args[end].args
        7         if isa(i, LineNumberNode)
        4             continue
        -         end
        3         push!(new_expr.args[end].args, i)
        7     end
        1     new_expr
        - end
        - 
        1 macro make_a_new_expr_4(expr)
        1     new_expr = Expr(:function, Expr(:call, :f_make_4), Expr(:block))
        1     for i in expr.args[end].args
       10         j = quote $i end
        7         push!(new_expr.args[end].args, j)
        7     end
        1     new_expr
        - end
        - 
        1 @make_a_new_expr_2 function f_make_2()
        1     println("0 new expr 2")
        1     println("1 new expr 2")
        1     println("2 new expr 2")
        - end
        - 
        - @make_a_new_expr_3 function f_make_3()
        -     println("0 new expr 3")
        -     println("1 new expr 3")
        -     println("2 new expr 3")
        - end
        - 
        0 @make_a_new_expr_4 function f_make_4()
        0     println("0 new expr 4")
        0     println("1 new expr 4")
        0     println("2 new expr 4")
        - end
        - 
        - f_make_2()
        - 
        - f_make_3()
        - 
        - f_make_4()

However, there are more complicated macros in which I have verified that LineNumberNodes are preserved, but no coverage is even tracked (the cov file has dashes, not zeros) and I have no idea how to debug that. E.g. the ResumableFunctions.@resumable example below (notice that it reports -, not 0 - and I did verify that the generated code has LineNumberNodes present, so I would have assumed at worst it should report 0 (presumably "not executed"), not - (presumably "not tracked at all")):

        - using ResumableFunctions
        - 
        1 function control()
        1     a = rand()
        1     if a<0.5
        1         return 0
        -     end
        0     return 1
        - end
        - 
        - @resumable function test()
        -     data = rand(100)
        -     for d in data
        -         p = sin(d)
        -         @yield p
        -         data .*= p
        -     end
        -     return data
        - end
        - 
        - control()
        - collect(test())

JuliaLang / julia

[FR] A wishlist for code coverage features #50834

Code Coverage

Motivation/Background

Potential solutions

The Wishlist

Others

Potential problems

`rand`

Effects of inspecting coverage

Others