JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.71k stars 5.49k forks source link

Effects on master are influenced by generation of coverage data #49978

Open Seelengrab opened 1 year ago

Seelengrab commented 1 year ago

Found while debugging the failures of FieldFlags.jl on nightly - the tests run fine on 1.9, haven't yet tried to bisect. Since it's not (yet) through its registration period, you'll have to ]add the github repo directly for reproducing.

Either way, the failures are reproducible with

julia --color=yes --project=. -e 'import Pkg; Pkg.test(;coverage=true)'

but not with

julia --color=yes --project=. -e 'import Pkg; Pkg.test(;coverage=false)'

which make the tests run through. I'll see if I can minimize the failure a bit more.

EDIT: For posteriority, a smaller MWE without dependencies:

[sukera@tower ~]$ julia --code-coverage -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)

julia> g() = h(2)
g (generic function with 1 method)

julia> Base.infer_effects(g,())
(+c,!e,+n,+t,+s,+m,+i)
[sukera@tower ~]$ julia -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)

julia> g() = h(2)
g (generic function with 1 method)

julia> Base.infer_effects(g,())
(+c,+e,+n,+t,+s,+m,+i)
oscardssmith commented 1 year ago

this is annoying but intentional. the problem is that asking for code coverage means that code is not effect free (because with code coverage on, deleting a function who's output is not used has the visible effect of changing the coverage number.

Seelengrab commented 1 year ago

:/

Does that mean more or less everyone relying on some effects to be there needs to run their code twice, once with coverage and once without? Or should I not test effects to this extent (would be a shame - I need those guarantees for AVR)?

It's not a long-running testsuite, but here's the smallest reproducer I have for now:

using FieldFlags
using Test

@bitflags struct JETStruct
    a
    _
    b
end

foldableAccess(j::JETStruct) = j.a

effects = Base.infer_effects(foldableAccess, (JETStruct,))
display(effects)
@test Core.Compiler.is_foldable(effects)

with outputs being

[sukera@tower FieldFlags.jl]$ julia --code-coverage -q --project=. repr.jl 
(+c,!e,+n,+t,+s,+m,+i)
Test Failed at [...]
  Expression: Core.Compiler.is_foldable(effects)

ERROR: LoadError: There was an error during testing
[...]
[sukera@tower FieldFlags.jl]$ julia -q --project=. repr.jl 
(+c,+e,+n,+t,+s,+m,+i)
Seelengrab commented 1 year ago

The PR introducing this change is https://github.com/JuliaLang/julia/pull/48598, if I'm not mistaken, which only taints the effect because otherwise the optimizer would remove the dead IR. Is it possible to implement code coverage differently, so that effect testing is not influenced by coverage? Or is there some other way I can test that these functions are foldable?

Seelengrab commented 1 year ago

If anything, I'd argue that tainting :effect_free here is incorrect - the method doesn't have externally visible sideeffects in the program (unless the program is inspecting its own coverage output, that is). Feels a bit similar to the exception that's made for elapsed time - technically observable, but taking it into account means we can't optimize anything.

oscardssmith commented 1 year ago

at the heart of this is a question of what coverage should measure. if coverage is meant to measure code as written, the compiler must not be allowed to change the code run in certain ways. we could probably add a coverage option that allows these options, but it is somewhat unclear what would then be measured.

Seelengrab commented 1 year ago

Here's a smaller MWE, which doesn't need FieldFlags.jl:

[sukera@tower ~]$ julia --code-coverage -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)

julia> g() = h(2)
g (generic function with 1 method)

julia> Base.infer_effects(g,())
(+c,!e,+n,+t,+s,+m,+i)
[sukera@tower ~]$ julia -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)

julia> g() = h(2)
g (generic function with 1 method)

julia> Base.infer_effects(g,())
(+c,+e,+n,+t,+s,+m,+i)

This doesn't make sense and is incosistent, as @code_warntype happily reports that this returns a constant, even in the covered mode:

[sukera@tower ~]$ julia --code-coverage -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)

julia> g() = h(2)
g (generic function with 1 method)

julia> @code_warntype g()
MethodInstance for g()
  from g() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(g)
Body::String
1 ─ %1 = Main.h(2)::Core.Const("string")
└──      return %1

If the tainting were real/proper, this ought to infer Union{Int, String} (which doesn't make sense to do, because the compiler clearly has enough information to propagate the constants).

KristofferC commented 1 year ago

If one of the "effects" of the function is to emit coverage data, then evaluating it at compile time would change that behavior, no?

This doesn't make sense and is incosistent, as @code_warntype happily reports that this returns a constant, even in the covered mode:

This is with coverage:

julia> @code_llvm  g()
;  @ REPL[2]:1 within `g`
define nonnull {}* @julia_g_925() #0 {
top:
  %0 = alloca [8 x i8], align 8
  %lcnt = load volatile i64, i64* inttoptr (i64 105553177974792 to i64*), align 8
  %1 = add i64 %lcnt, 1
  store volatile i64 %1, i64* inttoptr (i64 105553177974792 to i64*), align 8
  %2 = call { {}*, i8 } @j_h_927([8 x i8]* noalias nocapture noundef nonnull %0, i64 signext 2) #0
  %3 = extractvalue { {}*, i8 } %2, 0
  ret {}* %3
}

and without

julia> @code_llvm g()
;  @ REPL[2]:1 within `g`
define nonnull {}* @julia_g_477() #0 {
top:
  %0 = alloca [8 x i8], align 8
  %1 = call { {}*, i8 } @j_h_479([8 x i8]* noalias nocapture noundef nonnull %0, i64 signext 2) #0
  %2 = extractvalue { {}*, i8 } %1, 0
  ret {}* %2
}
Seelengrab commented 1 year ago

"emitting coverage data" is a bad effect in terms of side-effect freeness though, in particular when it puns on "is constant foldable". The reason being that, for the code that we're inspecting with coverage data, the coverage data is not a sideeffect - it's only a behavior changing sideeffect for the testsuite.

Seelengrab commented 1 year ago

In particular, the example I gave in the OP is a longer version of this MWE:

[sukera@tower ~]$ julia --code-coverage -q
julia> struct Foo end

julia> Base.getproperty(f::Foo, s::Symbol) = s === :foo ? "foo!" : getfield(f, s)

julia> f(x::Foo) = x.foo
f (generic function with 1 method)

julia> Base.infer_effects(f,(Foo,))
(+c,!e,+n,+t,+s,+m,+i)

julia> code_warntype(f,(Foo,))
MethodInstance for f(::Foo)
  from f(x::Foo) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(f)
  x::Core.Const(Foo())
Body::String
1 ─ %1 = Base.getproperty(x, :foo)::Core.Const("foo!")
└──      return %1

While the code in the OP doesn't return a Core.Const, this example does - and illustrates the issue nicely. Every part necessary for getproperty to infer a Core.Const is there, and it does do so - yet the function can't constant fold due to the effects introduced by --code-coverage, even though the only thing that could change the effects, the passed in Foo, is not relevant in the branch where the constant is ultimately inferred.

jakobnissen commented 1 year ago

In general, it's not a good idea to have the results of your code rely on certain optimisations that may be disabled when testing. Think of forced boundschecks, and debug mode(s) in some languages. Being able to add additional safety features during testing is a good thing

KristofferC commented 1 year ago

yet the function can't constant fold due to the effects introduced by --code-coverage

With apologies for repeating myself, but if the code is supposed to track every time it is called, how could it be valid to evaluate its result at compile time? Then that would change the output in the coverage files?

vtjnash commented 1 year ago

It is actually an interesting question though, once we expose infer_effects to users in an API, do we try to make it so that the code path taken is not impacted, and so you can still test code coverage without it bypassing the code being tested?

gbaraldi commented 1 year ago

I guess we could make it an invisible effect like a GC allocation? Though that doesn't solve the fact that it shouldn't concrete eval/fold, because it means you might get wrong coverage results

oscardssmith commented 1 year ago

I think a lot of the problems here could be resolved by adding a separate effect for coverage rather than tainting the existing effects. The only reason the change of effects is causing problems is because people want to test that their code has good effects (which for example will be needed for compiling to AVR). If we separated the coverage effect, it would be possible for people to test that their code will work as expected when deployed (without coverage).

Seelengrab commented 1 year ago

Though that doesn't solve the fact that it shouldn't concrete eval/fold, because it means you might get wrong coverage results

Here's the rub - the fact that the compiler is able to constant fold some call means it had enough information to find some path through the call, otherwise it could not have folded the call into a constant. So the compiler must have perfect knowledge about coverage already, since it knows which path it took to arrive at the constant - were that not the case, it would not have been able to fold the call into a constant (and even without being able to fold a call into a constant, it must be able to give some path/branch coverage after constant propagation, due to eliminating some error paths like in the example of getproperty).

It is actually an interesting question though, once we expose infer_effects to users in an API, do we try to make it so that the code path taken is not impacted, and so you can still test code coverage without it bypassing the code being tested?

Code coverage is a very interesting field, with lots of different variations. The following is loosely inspired by some literature I stumbled across years ago - I'll have to see if I can find it again. The ideas are definitely not my own, though I believe the julia specific application is.

Generally, each of the following list has either full, partial, or no coverage. Full coverage means "every variation of this is covered by some test code", partial coverage means "some variation of this is covered by some test code" and no coverage means "no variation of this is covered by some test code".

Julia currently only reports (fuzzy) line coverage, which is extremely easy to reach with a bit of effort (for example, the package I mentioned above has 100% coverage, yet I only consider it moderately well tested because julia doesn't report functions that aren't called at all as uncovered). So from my POV, "making coverage better" means being able to emit more granular information about what isn't covered, thereby reducing reported coverage percentages (because we're moving down the list of coverage guarantees). To get some of these levels, our source code needs to at least track columns as well as lines, otherwise we only get a subset of expressions, branches or paths that could be covered (think ternary expressions, for example).

So what does this have to do with coverage being an observable sideeffect or not? Well, coverage is a diagnostic global property of a program, not a property inherent to a function. Let me take the example from above again:

h(x) = iseven(x) ? "string" : 1

g() = h(2)

A priori, asking whether any part of this code is "covered" doesn't really make sense - there is no call to either g or h at the top level, and nothing is executed. So nothing can be covered because nothing runs that could produce coverage data. Now, the argument brought above is that the compiler is free to evaluate h(2) at compile time and thus the question is, should it produce coverage data for that call? My answer to that is "yes, but it doesn't necessarily need to report that as true coverage of either g or h, because no call causing the h(2) call actually occured in the (nonexistent) testsuite". The thinking is this: since there is no top level call to g(), whether or not h(2) is covered is unknown - the compiler can already compute the coverage data during constant folding (it found a valid path to do that in the first place after all), but it must not report it to the user as "this branch is covered", because that would leak the implementation detail/optimization of constant folding.

One solution is to of course disable constant folding when emitting coverage, but that then implies the issue about whether or not constant folding is part of observable API at some point, and how to test for that if it's actually required for some other reason (like compiling away an abstraction to build a nice API, which julia is extremely good at - it would be a shame not to be able to test for that!). This is likely sufficient to fix the immediate issue reported here, though doesn't help with getting better/more granular coverage reports overall.

Another solution is to keep track of coverage data (in however granular form we want to [1]), but only "merge" it with the parent call when actually requested/a top level call is done that inquires about the coverage of its subexpressions. An additional benefit of this approach is that we can ask the opposite question - where do I need to write more tests to increase coverage, by making top level calls that would cover more subexpressions that are not yet covered in some capacity by another call?

I think this can actually already work with abstract interpretation by keeping track of covered line data through calls & expressions, bubbling coverage data upwards. A function is fully covered if all its lines/subexpressions are covered - if some are covered it's partially covered and if none or it's not covered (you get the idea). Admittedly I don't have an implementation of that idea. The difficulty of course is that there's quite a lot of additional metadata one would like to get access to when asking about coverage - it's not impossible to think that julia could directly produce a "coverage percentage" (how bad of a metric that is) itself, using that method. A similar idea could be applied for tracking a lowerbound for how much memory a given call needs, though this is obviously offtopic and out of scope for this discussion.

Ah well, this should be its own seperate discussion, shouldn't it? I have some more points on coverage (in particular about a nasty if rand() < 0.5 example @oscardssmith brought up on slack), but those are definitely not appropriate for this issue.


[1]: e.g. for constant foldable expressions julia must already have perfect knowledge of line, expression, branch, path and input/output space coverage, because it was able to execute the code at compile time! This doesn't necessarily imply that the result of that is full coverage, but does imply partial coverage for all of these.

Seelengrab commented 1 year ago

people want to test that their code has good effects (which for example will be needed for compiling to AVR)

Of note, this is not technically a requirement for compiling to AVR in general, but it does make things a whole lot easier. I really can't have dynamic dispatch for now, so being able to know that from a run of JET.jl instead of through an error in GPUCompiler (which is missing the crucial information of where that dispatch occured) is a godsend. Being able to test for that seperately is even better, because FieldFlags.jl isn't targeting AVR in particular at all! It's just a package that happens to be good in that environment as well :)

Seelengrab commented 1 year ago

Here's another cool example/argument for why having coverage influence optimization is a bad thing: https://shapr.github.io/posts/2023-07-30-goldilocks-property-tests.html

I'd love to port this approach to PropCheck.jl, but without being able to reflect on coverage data in a safe way (without influencing the code I'm reflecting on), I can't really do this.

aplavin commented 4 months ago

Got affected by this as well, by seeing test run results different between local run (]test, without coverage by default) and on github actions (julia-runtests, with coverage by default). Would've impossible to debug if not for @IanButterworth and @Seelengrab suggesting (on slack) that the difference is due to coverage.

Basically, this test https://github.com/JuliaAPlavin/FlexiJoins.jl/blob/00000000ca2fe5eace0b881caa8d2258ca6154ee/test/runtests.jl#L407 fails for some examples when coverage is enabled, while the whole suite passes without coverage. Also, everything passes on Julia 1.9 no matter coverage on/off, it's only 1.10+ that broke it.

Maybe some coverage-related warnings should be added to docs somewhere (julia/pkg/github actions/...)? And also run PkgEval without coverage I guess.

nsajko commented 4 months ago

IMO the problem isn't specific to coverage, because various compilation options may influence effects, --compile=min, for example.

So I guess the real problem here is that test suites assume that the build is a production build (guaranteeing good effects, e.g.), even though it doesn't have to be. Not sure what's the best way to solve this in the ecosystem, but in the FixedSizeArrays.jl PR linked above I approached this by introducing an environment variable that's only used for the test suite.

IanButterworth commented 4 months ago

If a user is adjusting --compile=min etc. then they're likely aware of the variability in effects. I think julia-runtest defaulting to coverage being on is much more likely to cause confusion here.

I think it might be reasonable for tooling like @allocated etc. to @warn maxlog=1 if they are used in the presence of code coverage being turned on for that particular file.