edf-hpc / verrou

floating-point errors checker
https://edf-hpc.github.io/verrou/vr-manual.html
GNU General Public License v2.0
54 stars 13 forks source link

Documentation of uncounted operation #29

Closed hbrunie closed 3 years ago

hbrunie commented 4 years ago

Hi,

what does uncounted operation mean? Thanks, Cheers, Hugo

HadrienG2 commented 4 years ago

From memory, these are floating-point operations that verrou did not perturbate, either because it cannot perturbate them yet (e.g. comparisons) or because it was not configured to perturbate them (e.g. x87 floating-point operations aren't perturbated by default).

hbrunie commented 4 years ago

Thank you for these information.

Does sqrt fall into one of these categories? 27 ==9757== Uncounted operation: Sqrt64Fx2 (5751)

Also I launched verrou using these exclude libraries:

sym lib

But I don't think this message comes from these exclude, as many other function are excluded and do not throw any warning.

hbrunie commented 4 years ago

Also this is not in the same topic, but if you accept to answer here, Verrou shows me that there is a NaN created from log10(a), because a<0. This happens only when Verrou applies random rounding. But this does not happen in "normal" execution. A part from the fact that the program is sensitive to small perturbations, what can I learn from this? How should I investigate this? Thank you for this tool!

HadrienG2 commented 4 years ago

Regarding Sqrt64Fx2, I cannot answer with my current Verrou knowledge (I'm only a user and occasional patch submitter, not one of the main developers), so you'll need to wait for @ffevotte or @lathuili to pass by.


Regarding log10 NaNing out with random rounding, such "edge effects" can either be a false positive of verrou's analysis or an indication of a genuine fragility in your computation. You need to investigate further in order to understand which one it is.

One reason why it could indicate a genuine problem is that it suggests that your computation is passing a zero, or something close to zero, to log10. This is suspicious, because log(0) is mathematically undefined in the set of reals (which floating-point computations usually try to emulate, and which some HPC compilers assume floating-point numbers always contain), and logarithms of very small numbers can be less precise than those of "normal" numbers and easily diverge into -inf under very small changes to initial conditions or parallel computation scheduling.

As a first debugging step, I would walk up the backtrace emitted by Verrou when the NaN is created, to try to understand which part of your codebase is (indirectly) leading to a negative value being passed into log10 under random rounding. Note that if the failure is spurious, you can use --vr-seed together with the RNG seed printed by Verrou at the beginning of execution in order to stabilize the pseudorandom rounding into a failing configuration.

If this investigation leads you to believe that this is a false-positive of verrou, you have several options:

hbrunie commented 4 years ago

Thank you for such precise answer!

I'll leave the issue open for the Sqrt64Fx2 question. I will investigate in depth the source of the close to zero values passed to my log10. Thanks a lot, Hugo

lathuili commented 4 years ago

I'm back after a long break: sorry for the late answer.

Verrou is not yet able to instrument all floating point operations :

Amoung these uncounted operations there are the square root, cosinus and sinus when done with hardware instructions. When the compiler decides to use inlined hardware instructions instead of the dynanic library, the libm.ex exclusion file is not useful. In your case you are simply ignoring the rounding error implied by the square root. If the evaluation of the accuracy is not sufficient for your application, you have problem in the rest of your code and so ignoring the square root is not a problem for the localization step. If you want to hide the warnings you can use the suppression mecanism of valgrind. If you want to instrument the square root, for now I can only propose a workaround. You need to recompile your application to use the libm dynamic library and use the Interlibmath via LD_PRELOAD (short documentation Interlimath/README.md and one exemple in unitTest/check-libM/Makefile). The use of interlibmath is not yet straightforward.

To make it work without this workaround will be possible but requires a verrou developpment which is not (yet) high ranked in your roadmap.

I'm curious to see the mathematical expression f(x) which compute a. If we can prove that f(x)> 0 over the definition set, the problem is in the floating point implementation of f(x) or is a verrou false positive (delta debug can pinpoint both). If we can not prove it, the problem is the algorithm (or at least our algorithm understanding).