Closed Mikolaj closed 2 years ago
Ok, so the idea is to retain the CLI-configurable IORef
-based epsilon but only use it with @?~
and not with assertClose
/assertEqualUpToEps
. Correct?
Fine to keep the existing class AssertClose
and its instances?
Ok, so the idea is to retain the CLI-configurable
IORef
-based epsilon but only use it with@?~
and not withassertClose
/assertEqualUpToEps
. Correct?
Correct. We might add an optional parameter to assertEqualUpToEps, but I'd vote for simplicity and requiring the parameter always.
Fine to keep the existing
class AssertClose
and its instances?
I guess. But we'd need many more instances ^^^, basically whatever any hapless newbie may want to type, so perhaps all existing instance can/should be re-organized, I don't know.
Then I have a (stupid) question: why do we want to be able to do those approximate comparisons using hardcoded epsilon alues, like here:
assertEqualUpToEps (1e-10 :: Double)
(grad foo (1.1, 2.2, 3.3))
(2.4396285219055063, -1.953374825727421, 0.9654825811012627)
rather than relying on the CLI setting? Why do we want to have a multitude of different, hardcoded epsilon values scattered throughout the code, ignoring the CLI setting? What usecase does this serve?
That's a perfectly legit question. @awf explains that the admissible epsilon depends on the code in question. Just as in the example #64, you may have two different epsilons for two different code snippets. I imagine, exponentiation may blow up any numeric error much more than addition. Also, perhaps some code may be less numerically stable than another, but not catastrophically unstable.
I suppose it's also possible you have a requirement from a client that your results should not deviate from a table of known answers to sample problems more than X. Though that can be handled with a commandline option, but not if there are many clients with different epsilons and you keep it all in one CI.
My concern (in addition to code pollution) is that this way we hardwire the rigidity of the test suite into the code, whereas using the existing CLI option this could easily be adjusted, without code changes and without recompiling.
Perhaps we need something else. An ability to express the epsilon for each check in relation to the CLI epsilon, for example.
Or another idea could be to collect all those resulting differences from all checks, printing a summary at the end, highlighting the biggest one, and failing if it is bigger than the CLI epsilon.
What do you think?
Not sure if I understand the last usecase. First of all, our --eq-epsilon
is implemented in main
in each of the 3 test suites (MinimalTest
, ExtremelyLongTest
and ShortTestForCI
). Any client would have to re-implement that, or rely on the default value (1e-6
). Second, if you have multiple different clients with different epsilon requirements, why can't you just run them separately?
My concern (in addition to code pollution) is that this way we hardwire the rigidity of the test suite into the code, whereas using the existing CLI option this could easily be adjusted, without code changes and without recompiling.
Indeed, if we test an engine change, we may want to temporarity increase or decrease the epsilon across all tests.
Perhaps we need something else. An ability to express the epsilon for each check in relation to the CLI epsilon, for example.
I think it's too fiddly for a user to specify "default eps + 0.01". Perhaps instead a new options --eps-override
that overrides any individually specified epsilon with the one from commandline (or the default one, if no commandline is given)?
Or another idea could be to collect all those resulting differences from all checks, printing a summary at the end, highlighting the biggest one, and failing if it is bigger than the CLI epsilon.
Such stats would definitely be useful. I think that's yet another bool option "don't fail tests, but collect and list and summarize the differences vs expected results". This looks like a completely new feature to me. I think it at least makes sense to open a ticket about that.
Not sure if I understand the last usecase. First of all, our
--eq-epsilon
is implemented inmain
in each of the 3 test suites (MinimalTest
,ExtremelyLongTest
andShortTestForCI
). Any client would have to re-implement that, or rely on the default value (1e-6
). Second, if you have multiple different clients with different epsilon requirements, why can't you just run them separately?
That's a good point. It should be convenient enough to create a new module/new cabal component such as ShortTestForCI.hs
/ shortTestForCI
for each client I concurrently do ML data analysis for. What's the best way to set a different default epsilon in the main
function of such a test file?
From yet another POV, we are so far experimenting with our APIs. So if the result of the experiment is "assertEqualUpToEps is terrible expensive to maintain", that's valuable and if it's "assertEqualUpToEps is cheap, but after half a year of being advertised in the README nobody ever used it", that's also a good outcome. So I wouldn't worry too much about getting this just right.
I think it's too fiddly for a user to specify "default eps + 0.01".
I was rather thinking of things like: $2.5 \epsilon$ or $\epsilon^2$. But before we continue, can we verify if this is really of practical concern? I mean, do we have a real-world example of a numerical computation that would be out of bounds with the default $\epsilon$, but within the bounds of, say, $10 \epsilon$ and hence OK?
Perhaps instead a new options --eps-override that overrides any individually specified epsilon with the one from commandline (or the default one, if no commandline is given)?
I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like 0.0000002
or 0.00000003
which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here) or why are some of them failing, so instead we --eps-override
everything with 1e-4
and pretend it's all good.
I don't think hardcoding those epsilons in the code is a good idea.
Such stats would definitely be useful. I think that's yet another bool option "don't fail tests, but collect and list and summarize the differences vs expected results". This looks like a completely new feature to me. I think it at least makes sense to open a ticket about that.
Yes this would be a complete redesign. Not sure if supported by Tasty
.
That's a good point. It should be convenient enough to create a new module/new cabal component such as ShortTestForCI.hs/ shortTestForCI for each client I concurrently do ML data analysis for. What's the best way to set a different default epsilon in the main function of such a test file?
The default is set in TestCommonEqEpsilon.hs, then overwritten by the CLI option.
do we have a real-world example of a numerical computation that would be out of bounds with the default $\epsilon$, but within the bounds of, say, $10 * \epsilon$ and hence OK?
Yes, If you set the epsilon in close1
to 1e-11 the test
cabal test shortTestForCI --enable-optimization --test-options='-p "Compare two forward derivatives and gradient for Mnist1"'
fails; if you set it to 1e-10, it passes (at least judging by a single run on my machine). This is real MNIST digit recognition, so real world. To be sure the failing equality can be restricted to that between gradient and derivative (both real world methods) and not the perturbation, I've removed the perturbation part from qcPropDom
and I'm getting the same results in a single run.
I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like
0.0000002
or0.00000003
which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here) or why are some of them failing, so instead we--eps-override
everything with1e-4
and pretend it's all good.
I think ML is often done via throw-away scripts. So forgetting why 0.00000003
is not an issue.
I don't think hardcoding those epsilons in the code is a good idea.
For our library code, I agree. But we don't hardcode there. For our library tests, I don't need to remember why 0.00000003
to get benefit from a test telling me something is now less precise than it was for the last 4 months. I may then blindly multiply the epsilone by 10 and be done with it, but I got my warning. For ML scripts written in a few days to get an answer and then deleted? I don't see any harm in hardcoding.
The default is set in TestCommonEqEpsilon.hs, then overwritten by the CLI option.
Sure. But how do I easily change that per-cabal-component (or even better, per-file)?
In case that it's not clear: I want to keep code written for each of my 10 clients in a different cabal component of my program that depends on our library. Each client has a different epsilon requirement. I tweak my engine (or perhaps also the local checkout of the library) globally, so each client is affected, so I want CI to make sure no epsilon is violated.
Edit; and to explain even more: clients don't want my code, they want ML data analysis, perhaps many times over a few months, as data changes. Then I trash the code written for them, but the engine or library improvements remain.
@sfindeisen: apologies for editing your comment while I tried to answer to it (I chose Edit instead of Quote reply from the menu). I'm afraid some of the text got lost. Could you recover it?
This may be harder than I thought, because AssertClose
needs to work for shaped tensors, see assertEqualUpToEpsS
. Unless OS.toList
solves the issue completely and GHC doesn't complain about missing OS.Shape
constraints. Wait, OS.Array
is an instance of Traversable
, so perhaps the old AssertClose
code already covers this whole case?
Edit: nope, the Traversable
instance is only for the boxed version. However, we have defined a MonoFunctor
instance already, so MonoTraversable
may be doable.
Argggh. Now I did it - editing the comment rather than replying. I wonder if GitHub have changed some UI, as I don't recall ever having done that before. Let me see if I can restore.
Then I have a (stupid) question: why do we want to be able to do those approximate comparisons using hardcoded epsilon values, like here:
assertEqualUpToEps (1e-10 :: Double) (grad foo (1.1, 2.2, 3.3)) (2.4396285219055063, -1.953374825727421, 0.9654825811012627)
rather than relying on the CLI setting? Why do we want to have a multitude of different, hardcoded epsilon values scattered throughout the code, ignoring the CLI setting? What usecase does this serve?
Because the value of the tolerance is a function of the code of foo
and the precision of its arguments (known to be IEEE double here). It's not a question of what the command-line user believes, it's a question of how the computation is expected to propagate imprecision. To compute the "correct" tolerance requires (in decreasing order of accuracy)
(grad foo)
, using propagation rules such as those in WECSSKAFPA. This is of course hard, particularly as the code is machine-generated, and we may wish to allow ourselves the freedom to change the generated sequence of operations.sqrt(eps)
.I'm sure you accept that the "golden" target values in the third line need to be hard coded, I argue that tolerance is no different.
A related discussion is at https://www.fitzgibbon.ie/floating-point-equality, although not in the context of unit testing - I should update that.
BTW, I tend to reserve the use of epsilon
to mean the actual machine precision e.g. 2e-16 for double, and then tolerance
for the specific tolerance value for a given test.
I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like
0.0000002
or0.00000003
which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here)
This is absolutely true, and sadly must be true. It's not entirely unlike 0x5f3759df
. One could get rid of the constant but the game would run at 28FPS not 30, and that's millions of dollars in lost sales, albeit a saving of thousands of dollars in maintenance costs (all figures 1990s :) ).
But again, the "golden" target values are also magic constants. Of course, it's easier to understand them - they must have come from some manual computation (e.g. a symbolic algebra package), and we should probably add a link to the script that generated them. We should similarly add a link to the back of the envelope used to compute tol
.
Argggh. Now I did it - editing the comment rather than replying. I wonder if GitHub have changed some UI, as I don't recall ever having done that before. Let me see if I can restore.
I've now restored the comment (the "Then I have" one). I think all comments are corrected now. :)
Thank you very much, Staszek. Let me know if you'd like to attempt #68.
Apologies, I meant #66, which has a lower chance of being impossible. Only then the impossible one, please.
The client has completely changed the spec for testing up to eps (#64; or rather the management didn't ask him in time and now finds excuses). Now
assertClose
should be calledassertEqualUpToEps
, takes the eps as the first argument, should work on many types, just as@?~
did previously, which now should differ only in that it uses the default eps from theIORef
. The accepted types should include tuples (up to 10 elements, say) of any types accepted previuously.@sfindeisen: you have dibs on this as the author of the subsystem, but please feel free to skip the privilege and instead take something harder, e.g., the generic library thing from #64 [edit: started in #66].