Closed slava77 closed 5 years ago
assign reconstruction
New categories assigned: reconstruction
@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @slava77 Slava Krutelyov.
@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@slava77 Are both ref
and new
using OpenBLAS in these cases? Or ref
is before the OpenBLAS change?
@slava77 Are both
ref
andnew
using OpenBLAS in these cases? Orref
is before the OpenBLAS change?
OK, I just figured out what ref
is so I think both are using OpenBLAS. One question is, do the machines running the tests all support the same CPU instruction sets? We now build OpenBLAS with DYNAMIC_ARCH=1
so different instructions may be used on a machine supporting e.g., AVX.
One question is, do the machines running the tests all support the same CPU instruction sets? We now build OpenBLAS with
DYNAMIC_ARCH=1
so different instructions may be used on a machine supporting e.g., AVX.
I think that we are using some generic "intel", which are all most likely coming with AVX.
@smuzaffar please check/clarify.
@slava77
I ran 136.7611 on two machines w/ and w/ AVX and indeed got different results -- looks like ref
ran w/o AVX and new
w/ AVX. The difference is at numerical precision level (~1e-7).
What were the machine specs/flag lists for the two cases?
it sounds like we should update our lib dependencies to something stable.
@slava77 , yes we are using Generic Intel cpus. It is not guaranteed that all of them will be support same instruction sets. Most of them have AVX2 but some with only AVX. I think enabling DYNAMIC_ARCH=1 might cause issues specially if we run on grid node without the AVX.
@smuzaffar please remind me what is our default (and where is it set) for CMSSW.
I thought it was core2, but now I only find it in the tensorflow .spec
There is no one place to set it. Every spec file sets it. For tensorflow we are setting it to core2 https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/tensorflow-sources.spec#L29
but for openblas, we just set DYNAMIC_ARCH=1
which sets it based on the machine it build.
For CMSSW , we are setting it to -msse3 https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_0_X/gcc700/gcc-toolfile.spec#L185
@hqucms please check if the speedup in openblas and a request to use it that eventually lead to this issue was a result of compilation flags. Perhaps the easiest is to take openblas and recompile it with march=core2 or msse3 and compare to the default where you showed the large speedup.
@slava77
The 4-5x speed-up reported in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-506422754 was using the previous OpenBLAS build in CMSSW, w/ TARGET=PENRYN DYNAMIC_ARCH=0
. Enabling DYNAMIC_ARCH
can further reduce the runtime of DeepAK8 by ~25% on machines w/ AVX/AVX2, as mentioned in https://github.com/cms-sw/cmssw/issues/25230#issuecomment-507617237.
I re-ran 136.7611 with the previous version of OpenBLAS (so DYNAMIC_ARCH=0
) on machines w/ AVX+AVX2 and w/o, and got consistent results. So it seems that rolling back to DYNAMIC_ARCH=0
should solve this stability issue and still give us reasonable speed-up.
To explain a bit further, what OpenBLAS does with DYNAMIC_ARCH=1
is that it builds optimized kernels for a number of architectures and selects the best one at runtime based on the available instructions.
@hqucms
thank you for checking the static/fixed target performance.
I propose to switch to DYNAMIC_ARCH=0
.
IIUC, PENRYN is more advanced than core2/sse3 that we use in the remainder of the builds: it contains up to sse4.1.
To be consistent we should probably get it in core2/sse3 (at least until we centrally change the settings).
@slava77
I also tested TARGET=CORE2 DYNAMIC_ARCH=0
. It seems to be a little bit slower but not much [0], so looks like we could use that.
A more general question is how strictly do we need to stick to such exact reproducibility -- the outputs of the DNNs are probably needed only up to 3 or 4 decimal points (certainly not as many as 6 or 7). On the other hand, the speed-up from exploiting more advanced instructions (AVX/AVX2) can be substantial. The same situation is not just for OpenBLAS, but also for onnxruntime which also uses a similar approach as DYNAMIC_ARCH
.
[0]
gslblas: 20ms
openblas (CORE2): 5.8ms
openblas (PENRYN): 5.4ms
openblas (DYNAMIC_ARCH=1): 3.9ms [cpu supports avx2]
I think that we can not tolerate bitwise non-reproducibility at the incremental validation step (PR tests).
If bitwise reproducibility can be maintained with different architecture builds, then the dynamic choice is trivially acceptable. Do you understand why do we get the differences? Is e.g. the order of operands changing in the floating point math? Are the relevant/available flags used in compilation to preserve the order?
For the production context, we typically require validation with high stat samples for fixed arch builds. I guess that once the variants are individually validated, it's OK to mix and match them.
OTOH, we've semi-silently accepted the cross-platform differences for the same arch builds (blamed on e.g. the fast reciprocal implementation difference between intel and AMD).
@slava77
I think that we can not tolerate bitwise non-reproducibility at the incremental validation step (PR tests). If bitwise reproducibility can be maintained with different architecture builds, then the dynamic choice is trivially acceptable.
So do you mean something like one build for architectures supporting newer instructions like AVX and AVX2, and another one not supporting them? Then I think within each of them bitwise reproducibility will be preserved -- the difference we have seen happens only when comparing across them.
Do you understand why do we get the differences? Is e.g. the order of operands changing in the floating point math? Are the relevant/available flags used in compilation to preserve the order?
In BLAS libraries the kernels are often written in assembly using different implementations for different instruction sets to maximize performance. I don't know if it's possible to ensure bitwise reproducibility across architectures in that case...
BTW what fraction of CPUs on the grid actually support AVX/AVX2? If it's still only a small fraction then it's probably not worth bothering anyhow...
On 7/12/19 2:33 PM, Huilin Qu wrote:
BTW what fraction of CPUs on the grid actually support AVX/AVX2? If it's still only a small fraction then it's probably not worth bothering anyhow...
AVX is probably available on most.
i don't know detailed stats. my response was on a general perspective of cmssw development with alt arches. the present state requires a fixed best available common set (sse3 or core2) .
@slava77
OK -- so then for the specific problem with OpenBLAS, the solution would be to build it with TARGET=CORE2 DYNAMIC_ARCH=0
I suppose?
On 7/12/19 2:53 PM, Huilin Qu wrote:
OK -- so then for the specific problem with OpenBLAS, the solution would be to build it with |TARGET=CORE2 DYNAMIC_ARCH=0| I suppose?
yes at least atm
@slava77 OK, I made the changes in https://github.com/cms-sw/cmsdist/pull/5091.
@slava77 @hqucms thank you, I agree with Slava that we should avoid architecture-dependent non reproducibility as much as possible, especially until when the gain looks relatively limited (especiall in comparison with the original version of blas)
+1
resolved in the build specs of openBlas
This issue is fully signed and ready to be closed.
This was first noticed as a difference in cms-sw/cmsdist#5063 but since then changes are occasionally appearing in workflows that are not clearly related to the deepAK8 tagger inputs or code
@hqucms please take a look. I suggest running valgrind in a recent IB after the switch to openblas happened.