easybuilders / easybuild-easyblocks

Collection of easyblocks that implement support for building and installing software with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
104 stars 284 forks source link

CP2K fails with recent Intel compilers. #1174

Open klust opened 7 years ago

klust commented 7 years ago

There are several issues with CP2K as compiled using the cp2k.py easyblock and associated easyconfigs.

This made me switch to manual compiles, using an arch file that I adapted based on information I got from Andreas Glöss. Which made me bump in two other issues:

To help in reconstructing what I have done, I include:

I don't have enough knowledge of Python and the innards of EasyBuild nor the time to implement these ideas in the CP2K and Libint easyblocks and easyconfigs myself beyond what I have done above.

CP2K-experiments-UA.zip

wpoely86 commented 7 years ago

@klust the attached zip is empty

klust commented 7 years ago

I hope this one works better...

CP2K-experiments-UA.zip

boegel commented 7 years ago

@klust Thank you for the excellent feedback!

The problem with mpirun has been fixed now in https://github.com/hpcugent/easybuild-framework/pull/2221, which will be part of EasyBuild v3.3.0.

I'll try and find time to look into your other suggestions/fixes, unless someone else beats me to it, of course.

It's good to know that you can come up with a setup in which all CP2K regression tests pass, we sort of always blamed the tests themselves for partially failing...

boegel commented 6 years ago

Note to self: the problem with -I -heap-arrays 64 (i.e. a missing argument to -I) occurs because no value is set for $(INTEL_INCF) used by the Makefile when FFTW is also being used together with Intel MKL. I'll look into fixing that.

boegel commented 6 years ago

@klust With the easyconfigs you provided I indeed also et 0 failed/wrong tests when installing on CentOS 7 & Intel Sandy Bridge, but on Intel Haswell I'm seeing 3 wrong tests (see below).

 --------------------------------------------------------------------------
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-1-4/water_atprop_ewald.inp.out :
  POTENTIAL ENERGY : ref = 0.375664704477E-02 new = 0.375664704476E-02
  relative error :   2.66189859e-12 >  numerical tolerance = 1.0E-14
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-no-restraint.inp.out :
  POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
  relative error :   3.74798766e-12 >  numerical tolerance = 2e-12
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-ck-restraint.inp.out :
  POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
  relative error :   3.74798766e-12 >  numerical tolerance = 2e-12
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 --------------------------------------------------------------------------
  ...
 --------------------------------- Summary --------------------------------
 Number of FAILED  tests 0
 Number of WRONG   tests 3
 Number of CORRECT tests 2852
 Number of NEW     tests 10
 Total number of   tests 2865

Maybe the numeric tolerance is a bit strict for those, I can't rely tell.

Can you provide some more details about the system(s) you tested on, for context?

I'm looking into the necessary changes to also achieve these type of test results through the CP2K easyblock, your feedback is really helpful there, thanks again!

klust commented 6 years ago

@boegel I checked our results. When I submitted the request, I had only compiled on Hopper, a ivybridge-based machine. Both on Scientific Linux 6 and CentOS 7 I get 0 wrong results. But when I tried to recompile on Leibniz, which was only after submitting this report as we have had lots of difficulties getting the cluster fully up, I actually also got exactly the same 3 errors as you mentioned and even exactly the same numbers in the error reports as you show above.

Since there were only three errors I haven't looked further into them so far, but my guess is that they may be due to higher round-off errors after some haswell/broadwell-specific optimizations by the compiler or some CPU-specific code in libraries.

So you have a 100% perfect reproduction of what we got.

tovrstra commented 6 years ago

It is probably worth testing if the problem still persists with less aggressive compiler optimization flags. In any case, I would not worry about these small errors.

The CP2K tests are not absolute. They only compare results between two compilations. If there is a difference, you don't know which is wrong or right. Even if all the numbers are the same, both could be wrong. Nobody ever checked if the tests produce numbers comparable to other simulation codes or simple analytic calculations (where possible). They are basically only an aid to detect issues when making a small change to the code or to the way it is compiled.

tovrstra commented 6 years ago

This could be worth a try: https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

TL;DR export MKL_CBWR=COMPATIBLE before running the tests.

boegel commented 6 years ago

@tovrstra I just gave that a try, it does indeed dramatically decrease the number of failed tests on Intel Haswell (for CP2K 4.1 with intel/2017a, after some other fixes to the CP2K easyblock, now also trying with CP2K 5.1 with intel/2017b).

However, since that basically disables the use of AVX & beyond by Intel MKL, shouldn't this always be set when using CP2K with Intel MKL 11.x or more recent? And if so, how does that affect performance?

This really boils to a performance vs accuracy, and we need to be careful which one we sacrifice (most).

tovrstra commented 6 years ago

This is indeed a difficult question and I don't have a straight answer. I just noticed also the following:

# === MKL < 11.3 has an interface bug (use this as a workaround) ===
LIBS    += -Wl,--start-group \
             $(INTEL_MKL_LIB)/libmkl_intel_lp64.a \
             $(INTEL_MKL_LIB)/libmkl_core.a \
             $(INTEL_MKL_LIB)/libmkl_sequential.a \
           -Wl,--end-group \
           -lpthread -lm -ldl
# === MKL < 11.3 'Link Line Advisor' ===
#LIBS    += -Wl,--start-group \
#             $(INTEL_MKL_LIB)/libmkl_gf_lp64.a \
#             $(INTEL_MKL_LIB)/libmkl_core.a \
#             $(INTEL_MKL_LIB)/libmkl_sequential.a \
#           -Wl,--end-group \
#           -lpthread -lm

on https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/rev_18111.txt (linked from https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/index.html) Could this be related?

boegel commented 6 years ago

@tovrstra EasyBuild correctly uses the combination of -lmkl_intel_lp64 -lmkl_sequential -lmkl_core already (and I suspect we would run into hard compilation failures if we weren't).

tovrstra commented 6 years ago

ok. The MKL_CBWR=COMPATIBLE setting is probably only useful for extremely detailed comparisons. If it really matters for a calculation, I would say there is a bug in the code. Useful algorithms should be robust enough to handle small numerical noise issues. That said, CP2K may contain a few of these bugs. :(

hfp commented 6 years ago

One goal is to validate important/complete HPC codes for every compiler release. I am working on it, but until then I can only recommend compiler releases that are known to work: http://xconfigure.readthedocs.io/cp2k/README/#sanity-check, i.e. I did not (successfully) validate Intel 2018 suite.

boegel commented 6 years ago

TODO: enhance CP2K easyblock to automatically pick right DFLAGS based on how Libint was configured...