Change of spec for assertClose: it needs to take eps as an argument

Mikolaj commented 2 years ago

The client has completely changed the spec for testing up to eps (#64; or rather the management didn't ask him in time and now finds excuses). Now assertClose should be called assertEqualUpToEps, takes the eps as the first argument, should work on many types, just as @?~ did previously, which now should differ only in that it uses the default eps from the IORef. The accepted types should include tuples (up to 10 elements, say) of any types accepted previuously.

@sfindeisen: you have dibs on this as the author of the subsystem, but please feel free to skip the privilege and instead take something harder, e.g., the generic library thing from #64 [edit: started in #66].

sfindeisen commented 2 years ago

Ok, so the idea is to retain the CLI-configurable IORef-based epsilon but only use it with @?~ and not with assertClose/assertEqualUpToEps. Correct?

Fine to keep the existing class AssertClose and its instances?

Mikolaj commented 2 years ago

Ok, so the idea is to retain the CLI-configurable IORef-based epsilon but only use it with @?~ and not with assertClose/assertEqualUpToEps. Correct?

Correct. We might add an optional parameter to assertEqualUpToEps, but I'd vote for simplicity and requiring the parameter always.

Fine to keep the existing class AssertClose and its instances?

I guess. But we'd need many more instances ^^^, basically whatever any hapless newbie may want to type, so perhaps all existing instance can/should be re-organized, I don't know.

sfindeisen commented 2 years ago

Then I have a (stupid) question: why do we want to be able to do those approximate comparisons using hardcoded epsilon alues, like here:

  assertEqualUpToEps (1e-10 :: Double)
    (grad foo (1.1, 2.2, 3.3))
    (2.4396285219055063, -1.953374825727421, 0.9654825811012627)

rather than relying on the CLI setting? Why do we want to have a multitude of different, hardcoded epsilon values scattered throughout the code, ignoring the CLI setting? What usecase does this serve?

Mikolaj commented 2 years ago

That's a perfectly legit question. @awf explains that the admissible epsilon depends on the code in question. Just as in the example #64, you may have two different epsilons for two different code snippets. I imagine, exponentiation may blow up any numeric error much more than addition. Also, perhaps some code may be less numerically stable than another, but not catastrophically unstable.

I suppose it's also possible you have a requirement from a client that your results should not deviate from a table of known answers to sample problems more than X. Though that can be handled with a commandline option, but not if there are many clients with different epsilons and you keep it all in one CI.

sfindeisen commented 2 years ago

My concern (in addition to code pollution) is that this way we hardwire the rigidity of the test suite into the code, whereas using the existing CLI option this could easily be adjusted, without code changes and without recompiling.

Perhaps we need something else. An ability to express the epsilon for each check in relation to the CLI epsilon, for example.

Or another idea could be to collect all those resulting differences from all checks, printing a summary at the end, highlighting the biggest one, and failing if it is bigger than the CLI epsilon.

What do you think?

Not sure if I understand the last usecase. First of all, our --eq-epsilon is implemented in main in each of the 3 test suites (MinimalTest, ExtremelyLongTest and ShortTestForCI). Any client would have to re-implement that, or rely on the default value (1e-6). Second, if you have multiple different clients with different epsilon requirements, why can't you just run them separately?

Mikolaj commented 2 years ago

My concern (in addition to code pollution) is that this way we hardwire the rigidity of the test suite into the code, whereas using the existing CLI option this could easily be adjusted, without code changes and without recompiling.

Indeed, if we test an engine change, we may want to temporarity increase or decrease the epsilon across all tests.

Perhaps we need something else. An ability to express the epsilon for each check in relation to the CLI epsilon, for example.

I think it's too fiddly for a user to specify "default eps + 0.01". Perhaps instead a new options --eps-override that overrides any individually specified epsilon with the one from commandline (or the default one, if no commandline is given)?

Or another idea could be to collect all those resulting differences from all checks, printing a summary at the end, highlighting the biggest one, and failing if it is bigger than the CLI epsilon.

Such stats would definitely be useful. I think that's yet another bool option "don't fail tests, but collect and list and summarize the differences vs expected results". This looks like a completely new feature to me. I think it at least makes sense to open a ticket about that.

Not sure if I understand the last usecase. First of all, our --eq-epsilon is implemented in main in each of the 3 test suites (MinimalTest, ExtremelyLongTest and ShortTestForCI). Any client would have to re-implement that, or rely on the default value (1e-6). Second, if you have multiple different clients with different epsilon requirements, why can't you just run them separately?

That's a good point. It should be convenient enough to create a new module/new cabal component such as ShortTestForCI.hs/ shortTestForCI for each client I concurrently do ML data analysis for. What's the best way to set a different default epsilon in the main function of such a test file?

Mikolaj commented 2 years ago

From yet another POV, we are so far experimenting with our APIs. So if the result of the experiment is "assertEqualUpToEps is terrible expensive to maintain", that's valuable and if it's "assertEqualUpToEps is cheap, but after half a year of being advertised in the README nobody ever used it", that's also a good outcome. So I wouldn't worry too much about getting this just right.

sfindeisen commented 2 years ago

I think it's too fiddly for a user to specify "default eps + 0.01".

I was rather thinking of things like: $2.5 \epsilon$ or $\epsilon^2$. But before we continue, can we verify if this is really of practical concern? I mean, do we have a real-world example of a numerical computation that would be out of bounds with the default $\epsilon$, but within the bounds of, say, $10 \epsilon$ and hence OK?

Perhaps instead a new options --eps-override that overrides any individually specified epsilon with the one from commandline (or the default one, if no commandline is given)?

I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like 0.0000002 or 0.00000003 which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here) or why are some of them failing, so instead we --eps-override everything with 1e-4 and pretend it's all good.

I don't think hardcoding those epsilons in the code is a good idea.

Such stats would definitely be useful. I think that's yet another bool option "don't fail tests, but collect and list and summarize the differences vs expected results". This looks like a completely new feature to me. I think it at least makes sense to open a ticket about that.

Yes this would be a complete redesign. Not sure if supported by Tasty.

That's a good point. It should be convenient enough to create a new module/new cabal component such as ShortTestForCI.hs/ shortTestForCI for each client I concurrently do ML data analysis for. What's the best way to set a different default epsilon in the main function of such a test file?

The default is set in TestCommonEqEpsilon.hs, then overwritten by the CLI option.

Mikolaj commented 2 years ago

do we have a real-world example of a numerical computation that would be out of bounds with the default $\epsilon$, but within the bounds of, say, $10 * \epsilon$ and hence OK?

Yes, If you set the epsilon in close1 to 1e-11 the test

cabal test shortTestForCI --enable-optimization --test-options='-p "Compare two forward derivatives and gradient for Mnist1"'

fails; if you set it to 1e-10, it passes (at least judging by a single run on my machine). This is real MNIST digit recognition, so real world. To be sure the failing equality can be restricted to that between gradient and derivative (both real world methods) and not the perturbation, I've removed the perturbation part from qcPropDom and I'm getting the same results in a single run.

I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like 0.0000002 or 0.00000003 which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here) or why are some of them failing, so instead we --eps-override everything with 1e-4 and pretend it's all good.

I think ML is often done via throw-away scripts. So forgetting why 0.00000003 is not an issue.

I don't think hardcoding those epsilons in the code is a good idea.

For our library code, I agree. But we don't hardcode there. For our library tests, I don't need to remember why 0.00000003 to get benefit from a test telling me something is now less precise than it was for the last 4 months. I may then blindly multiply the epsilone by 10 and be done with it, but I got my warning. For ML scripts written in a few days to get an answer and then deleted? I don't see any harm in hardcoding.

The default is set in TestCommonEqEpsilon.hs, then overwritten by the CLI option.

Sure. But how do I easily change that per-cabal-component (or even better, per-file)?

In case that it's not clear: I want to keep code written for each of my 10 clients in a different cabal component of my program that depends on our library. Each client has a different epsilon requirement. I tweak my engine (or perhaps also the local checkout of the library) globally, so each client is affected, so I want CI to make sure no epsilon is violated.

Edit; and to explain even more: clients don't want my code, they want ML data analysis, perhaps many times over a few months, as data changes. Then I trash the code written for them, but the engine or library improvements remain.

Mikolaj commented 2 years ago

@sfindeisen: apologies for editing your comment while I tried to answer to it (I chose Edit instead of Quote reply from the menu). I'm afraid some of the text got lost. Could you recover it?

Mikolaj commented 2 years ago

This may be harder than I thought, because AssertClose needs to work for shaped tensors, see assertEqualUpToEpsS. Unless OS.toList solves the issue completely and GHC doesn't complain about missing OS.Shape constraints. Wait, OS.Array is an instance of Traversable, so perhaps the old AssertClose code already covers this whole case?

Edit: nope, the Traversable instance is only for the boxed version. However, we have defined a MonoFunctor instance already, so MonoTraversable may be doable.

awf commented 2 years ago

Argggh. Now I did it - editing the comment rather than replying. I wonder if GitHub have changed some UI, as I don't recall ever having done that before. Let me see if I can restore.

awf commented 2 years ago

Then I have a (stupid) question: why do we want to be able to do those approximate comparisons using hardcoded epsilon values, like here:
  assertEqualUpToEps (1e-10 :: Double)
    (grad foo (1.1, 2.2, 3.3))
    (2.4396285219055063, -1.953374825727421, 0.9654825811012627)
rather than relying on the CLI setting? Why do we want to have a multitude of different, hardcoded epsilon values scattered throughout the code, ignoring the CLI setting? What usecase does this serve?

Because the value of the tolerance is a function of the code of foo and the precision of its arguments (known to be IEEE double here). It's not a question of what the command-line user believes, it's a question of how the computation is expected to propagate imprecision. To compute the "correct" tolerance requires (in decreasing order of accuracy)

Detailed analysis of the calculations performed by (grad foo), using propagation rules such as those in WECSSKAFPA. This is of course hard, particularly as the code is machine-generated, and we may wish to allow ourselves the freedom to change the generated sequence of operations.
"Rule of thumb" analyses (e.g. "input x is squared and then subtracted from something - I probably need to think of setting tol to about sqrt(eps).
"Search": increase tol until it the test passes. This can be fine if you somehow believe the calculation was correct when you did the initial search, and it will then catch regressions.
"Command-line": store the correct tolerance somewhere else in the project, e.g. in a makefile or CI script. This is not my favourite, as it moves the tolerance far from the code that defines it.

I'm sure you accept that the "golden" target values in the third line need to be hard coded, I argue that tolerance is no different.

A related discussion is at https://www.fitzgibbon.ie/floating-point-equality, although not in the context of unit testing - I should update that.

BTW, I tend to reserve the use of epsilon to mean the actual machine precision e.g. 2e-16 for double, and then tolerance for the specific tolerance value for a given test.

awf commented 2 years ago

I can already see in my mind's eye those huge codebases, all polluted with fancy epsilons like 0.0000002 or 0.00000003 which nobody longer understands where they came from or whether they still make sense (something could have changed 4 levels down from here)

This is absolutely true, and sadly must be true. It's not entirely unlike 0x5f3759df. One could get rid of the constant but the game would run at 28FPS not 30, and that's millions of dollars in lost sales, albeit a saving of thousands of dollars in maintenance costs (all figures 1990s :) ).

But again, the "golden" target values are also magic constants. Of course, it's easier to understand them - they must have come from some manual computation (e.g. a symbolic algebra package), and we should probably add a link to the script that generated them. We should similarly add a link to the back of the envelope used to compute tol.

Mikolaj commented 2 years ago

Argggh. Now I did it - editing the comment rather than replying. I wonder if GitHub have changed some UI, as I don't recall ever having done that before. Let me see if I can restore.

I've now restored the comment (the "Then I have" one). I think all comments are corrected now. :)

Mikolaj commented 2 years ago

Thank you very much, Staszek. Let me know if you'd like to attempt #68.

Mikolaj commented 2 years ago

Apologies, I meant #66, which has a lower chance of being impossible. Only then the impossible one, please.

Mikolaj / horde-ad

Change of spec for assertClose: it needs to take eps as an argument #65