Closed devmotion closed 2 years ago
Is it possible to test this?
Merging #452 (4c45fe8) into master (35de8d2) will decrease coverage by
0.01%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #452 +/- ##
==========================================
- Coverage 93.18% 93.16% -0.02%
==========================================
Files 52 52
Lines 1261 1259 -2
==========================================
- Hits 1175 1173 -2
Misses 86 86
Impacted Files | Coverage Δ | |
---|---|---|
src/TestUtils.jl | 94.11% <100.00%> (ø) |
|
src/basekernels/matern.jl | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 35de8d2...4c45fe8. Read the comment docs.
The rule was wrong, so it seems it's not covered by the tests? Why was this workaround introduced?
It seems in #425 the commented out test_ADs(x->MaternKernel(nu=first(x)),[ν])
was replaced with test_ADs(() -> MaternKernel(; nu=ν))
. Was this intended? And could this explain why the workaround is not tested?
It seems in #425 the commented out
test_ADs(x->MaternKernel(nu=first(x)),[ν])
was replaced withtest_ADs(() -> MaternKernel(; nu=ν))
. Was this intended? And could this explain why the workaround is not tested?
Yes, that was intended, as discussed on that PR (https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/pull/425#issuecomment-1012036096), where @theogf wrote
I am proposing that we just ban differentiation through \nu and stop testing for it.
And while the change
-test_ADs(x->MaternKernel(nu=first(x)),[ν])
+test_ADs(() -> MaternKernel(; nu=ν))
worked for Forward- and ReverseDiff, the additional workaround in #425 was required for Zygote to also pass the AD test (w.r.t. the input features, not the kernel hyperparameters!). So the workaround is tested by virtue of the Zygote AD test on MaternKernel now also passing.
So the workaround is tested by virtue of the Zygote AD test on MaternKernel now also passing.
The main problem is that tests passed successfully for https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/pull/452/commits/08437ccc44fa134b86eb447f59d0bdfe651d6acd even though the pullback is clearly incorrect. So I was wondering if the pullback was evaluated at all.
The main problem is that tests passed successfully for 08437cc even though the pullback is clearly incorrect.
What was incorrect about the pullback?
So I was wondering if the pullback was evaluated at all.
It was, I only added it because without Zygote errored, see https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/pull/452#discussion_r849550385
What was incorrect about the pullback?
It returned the wrong number of outputs (was missing NoTangent()
derivative for the function itself).
Without this workaround, the Zygote AD test fails. You can see this in 30385ac and the corresponding build
The problem is just that https://github.com/FluxML/Zygote.jl/blob/95e10218f0187a839ac0539fefcbf39c727bf068/src/lib/array.jl#L30 does not support NotImplemented
. This should have been fixed in Zygote, introducing workarounds in KernelFunctions is the wrong approach.
This should have been fixed in Zygote, introducing workarounds in KernelFunctions is the wrong approach.
Yeah, I completely agree, but unfortunately it hasn't been fixed in Zygote, and I think "workaround + AD is working" is still preferable over "no workaround + AD is broken".
Instead of making a PR to KernelFunctions introducing a workaround you could have made one for Zygote that fixes the issue :thinking:
Instead of making a PR to KernelFunctions introducing a workaround you could have made one for Zygote that fixes the issue :thinking:
I managed to figure out how to fix the issue with a workaround in KernelFunctions. That doesn't mean I would immediately know how to identify & fix the underlying issue upstream. If I did know how to fix it there, I would have, as I have in various other cases. You seem to have an amazing overview as well as deep understanding of a lot of different packages. The ability and time for that is unfortunately not something that everyone has available in the same amount. I'm assuming you're not intending to, but you give the impression of being rather disparaging of other people who're not thinking the same way you are.
IMO usually it is very helpful to open at least Github issues in cases where it is not immediately clear how to fix the issue. Workarounds can be acceptable if they are fixed upstream eventually, but I assume it is unlikely that they are fixed if nobody makes the Zygote developers aware of it.
Generally, I would have given you this advice in your PR but as desired I don't review your PRs anymore.
@devmotion I've pared it down to a minimal reproducible example: https://github.com/FluxML/Zygote.jl/issues/1204 but I can't comment on anything else, if you've got further thoughts on what the issue is / what's needed to fix it, please comment there.
Test errors seem unrelated? I didn't change anything related to them, I think.
Test errors seem unrelated? I didn't change anything related to them, I think.
Agreed. Maybe we need to relax a tolerance somewhere?
Test errors seem unrelated? I didn't change anything related to them, I think.
Agreed. Maybe we need to relax a tolerance somewhere?
Yes, I think it's due to the changes in #447; the tests are stochastic (if you just re-run the failed tests, they might pass the next time - I've seen that in #454 as well which has zero code changes). In #447 I reduced the size of the kernel matrix evaluations; the isapprox() call (set with atol=1e-9, rtol=1e-9) is on the arrays as a whole, which does not look at individual elements, but instead checks whether the norm of the difference is within the tolerances, relative to the max of the norm of the individual matrices (see discussion). For the random inputs we're using in the tests, the norm of the kernel matrix is linear in the dimensions. So by reducing the size from ~1000 to ~10 we're reducing the norm by a factor ~100, so yes that suggests to me we should increase the rtol correspondingly.
Alternatively, we could change it to element-wise isapprox comparison? then it's independent of the matrix norm.
Shouldn't the scale be irrelevant if we use rtol
? I think it's problematic that we don't use rtol
in most of the tests but only in https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/blob/8e805ef25d745ab391e4ad424e690507e02cd042/src/test_utils.jl#L75-L76. Probably because otherwise tests failed - which IMO, however, would not indicate a problem with rtol
per se but that the default already was already too low even before the recent changes. I completely agree with https://discourse.julialang.org/t/unit-test-equality-for-arrays/8138/12?u=devmotion that in general you want rtol
since it is completely unclear what a meaningful atol
is in general. Specifying a small atol
as well mainly helps with comparisons that involve exact 0s which would fail otherwise.
I think it's problematic that we don't use
rtol
in most of the tests
Yes, I agree. Probably an oversight in the original test code of not realizing that by only setting atol you end up setting rtol=0...
Apparently https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/pull/455 was not sufficient.
Bump.
Bump.
Could we keep the global atol and rtol variables?
I don't mind. It just seemed unnecessary to have global variables that are used only in a single place, and clearer (ie easier to spot) if the values are just used as default keyword arguments in the function definition.
@theogf Did you see my last comment? Would you like it to be reverted?
I think keeping the global const is a bit more elegant and easily findable since it's in the same file. But again I do not mind, I'll just accept the PR and do what you want 😅
Fix the bug introduced in #425 (see https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/issues/450).
Removes the workaround.