Open klust opened 7 years ago
@klust the attached zip is empty
I hope this one works better...
@klust Thank you for the excellent feedback!
The problem with mpirun
has been fixed now in https://github.com/hpcugent/easybuild-framework/pull/2221, which will be part of EasyBuild v3.3.0.
I'll try and find time to look into your other suggestions/fixes, unless someone else beats me to it, of course.
It's good to know that you can come up with a setup in which all CP2K regression tests pass, we sort of always blamed the tests themselves for partially failing...
Note to self: the problem with -I -heap-arrays 64
(i.e. a missing argument to -I
) occurs because no value is set for $(INTEL_INCF)
used by the Makefile when FFTW is also being used together with Intel MKL. I'll look into fixing that.
@klust With the easyconfigs you provided I indeed also et 0 failed/wrong tests when installing on CentOS 7 & Intel Sandy Bridge, but on Intel Haswell I'm seeing 3 wrong tests (see below).
--------------------------------------------------------------------------
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-1-4/water_atprop_ewald.inp.out :
POTENTIAL ENERGY : ref = 0.375664704477E-02 new = 0.375664704476E-02
relative error : 2.66189859e-12 > numerical tolerance = 1.0E-14
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-no-restraint.inp.out :
POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
relative error : 3.74798766e-12 > numerical tolerance = 2e-12
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-ck-restraint.inp.out :
POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
relative error : 3.74798766e-12 > numerical tolerance = 2e-12
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
--------------------------------------------------------------------------
...
--------------------------------- Summary --------------------------------
Number of FAILED tests 0
Number of WRONG tests 3
Number of CORRECT tests 2852
Number of NEW tests 10
Total number of tests 2865
Maybe the numeric tolerance is a bit strict for those, I can't rely tell.
Can you provide some more details about the system(s) you tested on, for context?
I'm looking into the necessary changes to also achieve these type of test results through the CP2K easyblock, your feedback is really helpful there, thanks again!
@boegel I checked our results. When I submitted the request, I had only compiled on Hopper, a ivybridge-based machine. Both on Scientific Linux 6 and CentOS 7 I get 0 wrong results. But when I tried to recompile on Leibniz, which was only after submitting this report as we have had lots of difficulties getting the cluster fully up, I actually also got exactly the same 3 errors as you mentioned and even exactly the same numbers in the error reports as you show above.
Since there were only three errors I haven't looked further into them so far, but my guess is that they may be due to higher round-off errors after some haswell/broadwell-specific optimizations by the compiler or some CPU-specific code in libraries.
So you have a 100% perfect reproduction of what we got.
It is probably worth testing if the problem still persists with less aggressive compiler optimization flags. In any case, I would not worry about these small errors.
The CP2K tests are not absolute. They only compare results between two compilations. If there is a difference, you don't know which is wrong or right. Even if all the numbers are the same, both could be wrong. Nobody ever checked if the tests produce numbers comparable to other simulation codes or simple analytic calculations (where possible). They are basically only an aid to detect issues when making a small change to the code or to the way it is compiled.
This could be worth a try: https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr
TL;DR export MKL_CBWR=COMPATIBLE
before running the tests.
@tovrstra I just gave that a try, it does indeed dramatically decrease the number of failed tests on Intel Haswell (for CP2K 4.1 with intel/2017a
, after some other fixes to the CP2K easyblock, now also trying with CP2K 5.1 with intel/2017b
).
However, since that basically disables the use of AVX
& beyond by Intel MKL, shouldn't this always be set when using CP2K with Intel MKL 11.x or more recent?
And if so, how does that affect performance?
This really boils to a performance vs accuracy, and we need to be careful which one we sacrifice (most).
This is indeed a difficult question and I don't have a straight answer. I just noticed also the following:
# === MKL < 11.3 has an interface bug (use this as a workaround) ===
LIBS += -Wl,--start-group \
$(INTEL_MKL_LIB)/libmkl_intel_lp64.a \
$(INTEL_MKL_LIB)/libmkl_core.a \
$(INTEL_MKL_LIB)/libmkl_sequential.a \
-Wl,--end-group \
-lpthread -lm -ldl
# === MKL < 11.3 'Link Line Advisor' ===
#LIBS += -Wl,--start-group \
# $(INTEL_MKL_LIB)/libmkl_gf_lp64.a \
# $(INTEL_MKL_LIB)/libmkl_core.a \
# $(INTEL_MKL_LIB)/libmkl_sequential.a \
# -Wl,--end-group \
# -lpthread -lm
on https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/rev_18111.txt (linked from https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/index.html) Could this be related?
@tovrstra EasyBuild correctly uses the combination of -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
already (and I suspect we would run into hard compilation failures if we weren't).
ok. The MKL_CBWR=COMPATIBLE
setting is probably only useful for extremely detailed comparisons. If it really matters for a calculation, I would say there is a bug in the code. Useful algorithms should be robust enough to handle small numerical noise issues. That said, CP2K may contain a few of these bugs. :(
One goal is to validate important/complete HPC codes for every compiler release. I am working on it, but until then I can only recommend compiler releases that are known to work: http://xconfigure.readthedocs.io/cp2k/README/#sanity-check, i.e. I did not (successfully) validate Intel 2018 suite.
TODO: enhance CP2K easyblock to automatically pick right DFLAGS based on how Libint
was configured...
There are several issues with CP2K as compiled using the cp2k.py easyblock and associated easyconfigs.
This made me switch to manual compiles, using an arch file that I adapted based on information I got from Andreas Glöss. Which made me bump in two other issues:
To help in reconstructing what I have done, I include:
I don't have enough knowledge of Python and the innards of EasyBuild nor the time to implement these ideas in the CP2K and Libint easyblocks and easyconfigs myself beyond what I have done above.
CP2K-experiments-UA.zip