edf-hpc / verrou

floating-point errors checker
https://edf-hpc.github.io/verrou/vr-manual.html
GNU General Public License v2.0
49 stars 13 forks source link

verrou_dd usability issues #7

Closed HadrienG2 closed 6 years ago

HadrienG2 commented 6 years ago

Besides #6 , another issue which I have with verrou_dd is that it is a bit difficult to use correctly. Here are some suggestions of quality-of-life improvements:

HadrienG2 commented 6 years ago

Note that these ideas are based on my experience with verrou v1.1.0, but looking at the upcoming release notes, it looks like some of them are already in the process of being addressed.

ffevotte commented 6 years ago

Thanks a lot for your insight. For future reference, here are a few thoughts/comments about the issues you mention.

  1. should be fixed by 8155d3d.

  2. probably won't be addressed. One of the reasons why is related to point 6.: running all scripts in the same initial directory would make it difficult for a lot of programs to meet the "independent runs" requirements.

  3. good idea. Although not planned yet, it will probably be addressed in a future release. Do you have any specific idea / use case in mind? There are a few detectable reasons why the cache should be invalidated (such as changes in the run_script or cmp_script). But there are also a lot of undetectable reasons: bad luck in random perturbations that wrongly makes a configuration valid, changes in any program or data file used in the run_script of cmp_script, etc. I don't see any way to handle such cases without asking the user to manually empty the cache.

  4. verrou_dd should probably not be responsible for running verrou, since (i) the user should be able to specify any command-line option they want, and (ii) not all commands in the run_script should be executed within valgrind/verrou. If I understand your proposition correctly, exporting a $VERROU environment variable containing the correct command to run would be doable, but this would always contain something like  valgrind --tool=verrou, so that I'm not sure it would be very useful.

  5. good idea. It is already in the TODO list for a future release (probably the release after v2.0.0).

  6. should be at least improved by 8155d3d (but there might still be some ambiguous error reporting).

ffevotte commented 6 years ago

Regarding point 5 (verrou_dd parallelization), parallelization at the scale of the global DD algorithm is not yet available (but is on the TODO list as mentioned above). However, commit 536569ff introduced a feature to enable parallelism at the scale of the test of a given configuration.

When the VERROU_DD_NUM_THREADS environment variable is set, up to VERROU_DD_NUM_THREADS tests are run in parallel to determine if a given configuration is stable or not. This may speed up the test for stable configurations (since VERROU_DD_NRUNS runs must be performed in this case to validate the configuration). For unstable configurations, it does not save as much time: if a given configuration fails to validate the cmp_script at its first run, there would have been no need to launch many runs in parallel to get the same result.

Of course, if you have enough computational power, the optimal parallel setting is obtained when VERROU_DD_NUM_THREADS=VERROU_DD_NRUNS.

HadrienG2 commented 6 years ago
  1. Thanks a lot for this!
  2. Ah, yes, this is true, I did not think about the fact that many applications may leave cached state around in the working directory. Scratch this idea then.
  3. I can think of two paths towards improving the cache invalidation situation in the face of e.g. binary recompilation or configuration changes. One that is sane but unsatisfying, and one that is a little bit insane, but has more potential :)
    • Sane-but-meh option: Invalidate the cache by default when a top-level verrou_dd command is run. If the command calls itself recursively, make sure that the recursive invocations do not invalidate the cache. This can be done by adding an optional "--reuse-cache" script option, which will also allow users who know what they are doing to reuse a cache from a previous run.
    • Insane option: Well, in principle, one could strace run-script or cmp-script calls in order to tell which files the run_script and cmp_script recursively depend on, and then compute a hash of them to detect dependency changes... ^^
  4. You're right that this one was probably not such a good idea either.
  5. I have quickly played with VERROU_DD_NUM_THREADS on the long_name branch. At the time, I noticed two things:
    • As you explain, the current implementation effectively puts a processing barrier after each configuration is tested. It would be nice to be able to run tests of multiple configurations in parallel, although I can imagine this will require quite a bit of changes in the task scheduling and logging infrastructure (on the scheduling side you need to schedule runs more lazily and deal with concurrent configs, on the printing side you would need advanced VT-100 hackery to preserve the current output).
    • I observed a couple of transient issues where functions were mis-tagged as numerically unstable. At the time I attributed this to my code, but since then I have done a sequential run on the same program without issues. Maybe there are some race conditions around in the current task scheduling code, maybe there was a bug which has since been fixed, I will try investigate this further and tell you what I find.
  6. There is still an "internal error" message in the DDsym.allDeltaFailedMsg() function of verrou_dd. But I have not explored what error is being reported by this function yet, maybe that's an appropriate error message in this case.
HadrienG2 commented 6 years ago

These remaining two issues start to stray a bit apart from the original ergonomics focus, so maybe it would be better to split them into two new tickets. Shall I do that?

ffevotte commented 6 years ago

Yes, please create new issues for points 3. and 5.

I believe 6525492 fixes the last ambiguous error message in verrou_dd, so that we might be able to close this issue after that.