plmc runtime issues single vs. double precision

thomashopf commented 5 years ago

According to Frank, the single-precision binary (meant to be faster) we use for the pipeline on O2 takes longer than the double-precision version. I think this is particularly interesting for you @aggreen and @kpgbrock.

Needs to be evaluated if this speed regression is generally the case and if the issue persists after a single-precision recompile on O2 (could be some vector instruction set or library issue after all, the one in current use was compiled on O1 way back). If single-precision still doesn't give any speed gains, we probably should switch to the double-precision version.

At least for N=1, Frank experienced quite some differences in EC precision (and non-deterministic results between multi-CPU reruns with the double precision version), so that is also something to keep in mind and test.

@jingraham do you have any ideas what might be going on here?

jingraham commented 5 years ago

I just did a quick test to try and reproduce the performance differences. tl;dr The results are at least in a sensible order:

make target	precision	time 1 (s)	time 2 (s)
all-openmp	double	128.0	131.2
all-openmp32	single	95.8	93.2

To produce these, I ran an interactive session with srun --pty -p interactive --mem 4GB -c 4 -t 0-06:00 /bin/bash, compiled from scratch off of master and then ran the example usage for the DHFR alignment bin/plmc -o example/protein/DHFR.params -le 16.0 -lh 0.01 -m 100 -g -f DYR_ECOLI example/protein/DHFR.a2m. Both methods ran for 100 iterations.

@thomashopf There may be a few possible explanations, including infrastructure changes as you mentioned or potentially that the single and double-precision runs are converging differently. For example, on unusual alignments with strong curvature the double precision optimizer may be more stable or converge faster. If Frank, you, @aggreen, or @kpgbrock can possibly share alignments + usage for reproducing I'd be happy to take a look.

poelwijk commented 5 years ago

So for me I found that the 32-bit compiled one runs quite a bit slower. The ones I used are: 32-bit: plmc: bin/plmc (standard) 64-bit: plmc: <Benni's home>/plmc/bin/plmc (compiled by Benni). For me the 64-bit version ran in 42 hrs, while the 32-bit version took 67 hrs. Everything else, apart from the pointers to the plmc, is the same. Both runs are copied to my directory <Frank's home>/Precision Issue/ and then 32bit/ or 64bit/

jingraham commented 5 years ago

Thanks @poelwijk. Indeed it seems like the pipeline plmc binary is slow! I repeated the DHFR test from above with the binaries you mentioned:

binary	precision	time (s)
benni's	"double"	128.3
pipeline	"single"	232.9

So the double precision binary behaves as expected but the pipeline binary is considerably slower. It looks like it was compiled in January 2017. There haven't been any commits to change performance since then, so I bet @thomashopf is spot on that it's a compilation / optimization / linking issue with O2 hardware.

Pipeline folks - @thomashopf, @aggreen, @kpgbrock - Would one of you be able to do a fresh pull and compile (make all-openmp32) of plmc master in the pipeline? I can do it, too, if you let me know best way to go about not breaking current runs or usage.

~~(PS it's worth updating the plmc source to get the latest EVzoom focus-mode format).~~ Nevermind.

aggreen commented 5 years ago

Thanks guys. Just went to /groups/marks/pipelines/evcouplings/software/plmc, fetch the updates from origin, then ran make all-openmp32

Looks like this created a new executable so we should be all set. Thanks @poelwijk https://github.com/poelwijk for finding this, it will speed things up a lot for me as well.

On Thu, Jan 31, 2019 at 3:49 PM John Ingraham notifications@github.com wrote:

Thanks @poelwijk https://github.com/poelwijk. Indeed it seems like the pipeline plmc binary is slow! I repeated the DHFR test from above with the binaries you mentioned: binary precision time (s) benni's "double" 128.0 pipeline "single" 232.9

So the double precision binary behaves as expected but the pipeline binary is considerably slower. It looks like it was compiled in January 2017. There haven't been any commits to change performance since then, so I bet @thomashopf https://github.com/thomashopf is spot on that it's a compilation / optimization / linking issue with O2 hardware.

Pipeline folks - @thomashopf https://github.com/thomashopf, @aggreen https://github.com/aggreen, @kpgbrock https://github.com/kpgbrock - Would one of you be able to do a fresh pull and compile (make all-openmp32) of plmc master in the pipeline? I can do it, too, if you let me know best way to go about not breaking current runs or usage.

(PS it's worth updating the plmc source to get the latest EVzoom focus-mode format).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/debbiemarkslab/EVcouplings/issues/200#issuecomment-459501076, or mute the thread https://github.com/notifications/unsubscribe-auth/AHimXore_zEBRslPUJvPgyAdhl4thaTqks5vI1bDgaJpZM4aSorc .

jingraham commented 5 years ago

Great. Since my interactive session was open I reran the pipeline command and can happily report

binary	precision	time (s)
benni's	double	128.3
pipeline	single	232.9
pipeline, new	single	91.5

thomashopf commented 5 years ago

Wow. Thanks a lot for looking into this @jingraham and for updating @aggreen.

I have two more questions::

(PS it's worth updating the plmc source to get the latest EVzoom focus-mode format).

Were there any changes to the output format as compared to the one output by the pipeline?

And is there any non-deterministic part in the optimization? Even running with the same double-precision binary (Benni's), two runs may give sets of ECs with different precision. I had a quick look at the plmc stdout iteration table, and up to iteration 100 the numbers (target, log likelihood, ...) completely agree but then slowly start to diverge.

Thanks!

poelwijk commented 5 years ago

A little elaboration regarding the latter question by @thomashopf: I sometimes see differences as large as 5% in true positive ECs just by running multiple evcouplings with the exact same input.

jingraham commented 5 years ago

@thomashopf, I was mistaken on reading some commits and actually all should be fine regarding EVzoom output. It was only the bundled matlab script for json export that I had updated to match the evcouplings python function.

@thomashopf, @poelwijk, I ran some experiments to check determinism on the DHFR example run. It looks like it is specifically happening for multithreaded versions of the code:

platform	target	md5 checksum of DHFR.params, run 1	md5 checksum of DHFR.params, run 2
macOS 10.14.2	all32	067e135941fd72ce4fdf60b3539f1710	067e135941fd72ce4fdf60b3539f1710
macOS 10.14.2	all-openmp32	1d0542b761c196d500240a3b2acc0fa7	067e135941fd72ce4fdf60b3539f1710
HMS O2	all32	067e135941fd72ce4fdf60b3539f1710	067e135941fd72ce4fdf60b3539f1710

This would seem to indicate thread safety issues, but just in case I also ran valgrind on the openmp32 version. I was not able to identify any obvious explanations from memory leaks, referencing uninitialized values, etc.

My current working hypothesis would be non-thread safe computation. For example, it might be that some of the OpenMP critical sections are not doing safe reductions as intended.

It seems as though make all32 should be a safe patch in the meantime, though it would incur a loss of all multithreading speedups. Meanwhile, I will try to isolate the thread safety issues.

UPDATE: From a quick test it seems that also simply using the openmp code in single threaded mode with -n 1 gives same checksums as above.

jingraham commented 5 years ago

Looking into this further, I am not sure if complete floating-point determinism will be feasible with the current multithreading.

This is because floating point arithmetic is non-associative and the particular order in which OpenMP sums together sets of numbers may change every time a program runs. For a Fortran example of this issue see here.

In plmc, the main parallel operation is computing the per-column contributions to the pseudolikelihood. Different orderings of this summation may have subtly different roundoff errors, which may accumulate to produce a discernable difference in output.

Aside from disabling multithreading with make all32 or the -n 1 option, another option to consider is to switch to double precision. This will also have non-determinism issues, but the roundoff error should be considerably less. (On a quick test of 64-bit DHFR runs, the param checksums were different but the EC file checksums were identical).

@poelwijk, it sounds like you may have seen some pretty significant non-determinism in your runs. Was this for "typical alignments" or for some of your more unusual recent tasks? Would you be able to share the runs? It would be helpful to (1) know how much variation there was in the raw EC (CN) scores and (2) know how much switching to double precision changes that result.

thomashopf commented 5 years ago

This is really interesting... I had the non-associativity hypothesis at one point but could not imagine that it would result in such discernible effects on the EC level (the EC precision difference @poelwijk observed is using the double precision version).

I do wonder if it would be a safer bet to use the double precision version for the pipeline... we still almost get a 2x speed-up compared to the old single-precision binary after all ;)

jingraham commented 5 years ago

Yes, maybe double precision is the safer bet for now!

Regarding the 5% variation I would really like to understand it... @poelwijk, @thomashopf do you happen to have the raw CN scores or paramfiles handy? It would be helpful to figure out if these values are also exhibiting 5% variation or if it might be smaller parameter/CN variation that still has significant re-ordering affects on precision (e.g. if many of the CN values near the cutoff are close in magnitude).

poelwijk commented 5 years ago

I copied the triplicate runs to <Frank's directory>/non-determinism/ and then th092/ and reruns. The directory with /gen_plmc/ is the single precision run with otherwise the same parameters. The run with th092 and th092_rerun2/ are similar in TP; the one with th092_rerun/ is worse by about 4-5%. I guess in order to get the highest TP, for now I just avoid multithreading or I do several runs and pick the best (as limited precision issues can only degenerate the result and not improve it, this should be ok).

debbiemarkslab / EVcouplings

plmc runtime issues single vs. double precision #200