deepAK8 reproducibility problems after switching to openblas

slava77 commented 5 years ago

This was first noticed as a difference in cms-sw/cmsdist#5063 but since then changes are occasionally appearing in workflows that are not clearly related to the deepAK8 tagger inputs or code

@hqucms please take a look. I suggest running valgrind in a recent IB after the switch to openblas happened.

slava77 commented 5 years ago

assign reconstruction

cmsbuild commented 5 years ago

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 5 years ago

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

hqucms commented 5 years ago

@slava77 Are both ref and new using OpenBLAS in these cases? Or ref is before the OpenBLAS change?

hqucms commented 5 years ago

@slava77 Are both ref and new using OpenBLAS in these cases? Or ref is before the OpenBLAS change?

OK, I just figured out what ref is so I think both are using OpenBLAS. One question is, do the machines running the tests all support the same CPU instruction sets? We now build OpenBLAS with DYNAMIC_ARCH=1 so different instructions may be used on a machine supporting e.g., AVX.

slava77 commented 5 years ago

One question is, do the machines running the tests all support the same CPU instruction sets? We now build OpenBLAS with DYNAMIC_ARCH=1 so different instructions may be used on a machine supporting e.g., AVX.

I think that we are using some generic "intel", which are all most likely coming with AVX.

@smuzaffar please check/clarify.

hqucms commented 5 years ago

@slava77

I ran 136.7611 on two machines w/ and w/ AVX and indeed got different results -- looks like ref ran w/o AVX and new w/ AVX. The difference is at numerical precision level (~1e-7).

slava77 commented 5 years ago

What were the machine specs/flag lists for the two cases?

it sounds like we should update our lib dependencies to something stable.

smuzaffar commented 5 years ago

@slava77 , yes we are using Generic Intel cpus. It is not guaranteed that all of them will be support same instruction sets. Most of them have AVX2 but some with only AVX. I think enabling DYNAMIC_ARCH=1 might cause issues specially if we run on grid node without the AVX.

slava77 commented 5 years ago

@smuzaffar please remind me what is our default (and where is it set) for CMSSW.

I thought it was core2, but now I only find it in the tensorflow .spec

smuzaffar commented 5 years ago

There is no one place to set it. Every spec file sets it. For tensorflow we are setting it to core2 https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/tensorflow-sources.spec#L29

but for openblas, we just set DYNAMIC_ARCH=1 which sets it based on the machine it build.

For CMSSW , we are setting it to -msse3 https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/gcc-toolfile.spec#L185

slava77 commented 5 years ago

@hqucms please check if the speedup in openblas and a request to use it that eventually lead to this issue was a result of compilation flags. Perhaps the easiest is to take openblas and recompile it with march=core2 or msse3 and compare to the default where you showed the large speedup.

hqucms commented 5 years ago

@slava77 The 4-5x speed-up reported in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754 was using the previous OpenBLAS build in CMSSW, w/ TARGET=PENRYN DYNAMIC_ARCH=0. Enabling DYNAMIC_ARCH can further reduce the runtime of DeepAK8 by ~25% on machines w/ AVX/AVX2, as mentioned in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-507617237.

I re-ran 136.7611 with the previous version of OpenBLAS (so DYNAMIC_ARCH=0) on machines w/ AVX+AVX2 and w/o, and got consistent results. So it seems that rolling back to DYNAMIC_ARCH=0 should solve this stability issue and still give us reasonable speed-up.

To explain a bit further, what OpenBLAS does with DYNAMIC_ARCH=1 is that it builds optimized kernels for a number of architectures and selects the best one at runtime based on the available instructions.

slava77 commented 5 years ago

@hqucms thank you for checking the static/fixed target performance. I propose to switch to DYNAMIC_ARCH=0. IIUC, PENRYN is more advanced than core2/sse3 that we use in the remainder of the builds: it contains up to sse4.1. To be consistent we should probably get it in core2/sse3 (at least until we centrally change the settings).

hqucms commented 5 years ago

@slava77 I also tested TARGET=CORE2 DYNAMIC_ARCH=0. It seems to be a little bit slower but not much [0], so looks like we could use that.

A more general question is how strictly do we need to stick to such exact reproducibility -- the outputs of the DNNs are probably needed only up to 3 or 4 decimal points (certainly not as many as 6 or 7). On the other hand, the speed-up from exploiting more advanced instructions (AVX/AVX2) can be substantial. The same situation is not just for OpenBLAS, but also for onnxruntime which also uses a similar approach as DYNAMIC_ARCH.

[0]

gslblas: 20ms
openblas (CORE2): 5.8ms
openblas (PENRYN): 5.4ms
openblas (DYNAMIC_ARCH=1): 3.9ms [cpu supports avx2]

slava77 commented 5 years ago

I think that we can not tolerate bitwise non-reproducibility at the incremental validation step (PR tests).

If bitwise reproducibility can be maintained with different architecture builds, then the dynamic choice is trivially acceptable. Do you understand why do we get the differences? Is e.g. the order of operands changing in the floating point math? Are the relevant/available flags used in compilation to preserve the order?

For the production context, we typically require validation with high stat samples for fixed arch builds. I guess that once the variants are individually validated, it's OK to mix and match them.

OTOH, we've semi-silently accepted the cross-platform differences for the same arch builds (blamed on e.g. the fast reciprocal implementation difference between intel and AMD).

hqucms commented 5 years ago

@slava77

I think that we can not tolerate bitwise non-reproducibility at the incremental validation step (PR tests). If bitwise reproducibility can be maintained with different architecture builds, then the dynamic choice is trivially acceptable.

So do you mean something like one build for architectures supporting newer instructions like AVX and AVX2, and another one not supporting them? Then I think within each of them bitwise reproducibility will be preserved -- the difference we have seen happens only when comparing across them.

Do you understand why do we get the differences? Is e.g. the order of operands changing in the floating point math? Are the relevant/available flags used in compilation to preserve the order?

In BLAS libraries the kernels are often written in assembly using different implementations for different instruction sets to maximize performance. I don't know if it's possible to ensure bitwise reproducibility across architectures in that case...

BTW what fraction of CPUs on the grid actually support AVX/AVX2? If it's still only a small fraction then it's probably not worth bothering anyhow...

slava77 commented 5 years ago

On 7/12/19 2:33 PM, Huilin Qu wrote:

BTW what fraction of CPUs on the grid actually support AVX/AVX2? If it's still only a small fraction then it's probably not worth bothering anyhow...

AVX is probably available on most.

i don't know detailed stats. my response was on a general perspective of cmssw development with alt arches. the present state requires a fixed best available common set (sse3 or core2) .

hqucms commented 5 years ago

@slava77

OK -- so then for the specific problem with OpenBLAS, the solution would be to build it with TARGET=CORE2 DYNAMIC_ARCH=0 I suppose?

slava77 commented 5 years ago

On 7/12/19 2:53 PM, Huilin Qu wrote:

OK -- so then for the specific problem with OpenBLAS, the solution would be to build it with |TARGET=CORE2 DYNAMIC_ARCH=0| I suppose?

yes at least atm

hqucms commented 5 years ago

@slava77 OK, I made the changes in https://github.com/cms-sw/cmsdist/pull/5091.

fabiocos commented 5 years ago

@slava77 @hqucms thank you, I agree with Slava that we should avoid architecture-dependent non reproducibility as much as possible, especially until when the gain looks relatively limited (especiall in comparison with the original version of blas)

slava77 commented 5 years ago

+1

resolved in the build specs of openBlas

cmsbuild commented 5 years ago

This issue is fully signed and ready to be closed.

cms-sw / cmssw

deepAK8 reproducibility problems after switching to openblas #27504