CP2K fails with recent Intel compilers.

klust commented 7 years ago

There are several issues with CP2K as compiled using the cp2k.py easyblock and associated easyconfigs.

Recent Intel compilers have an issue with FFT in MKL that prevents CP2K from running properly in many cases as soon as more than one MPI rank is used. This issue has been confirmed by Andres Glöss, one of the CP2K authors, see https://groups.google.com/d/msg/cp2k/A2Rf79443D4/wjYK6lBbAQAJ He suggests to use FFTW3 instead. This issue seems to exist since the 2016b Intel toolchain.
After adding the FFTW module that I compiled with a pretty standard easyconfig to an otherwise pretty standard CP2K eb file, I almost immediately get an error when building CP2K. The compile command contains: "-I -heap-arrays 64" (copied directly from the log file), so a -I without matching directory, causing the compile to complain that it cannot find the file 64 as -heap-arrays is taken as the directory for -I. I suspect a test on line 444 of the cp2k.py easyblock is at fault, but I am not sure about this. (Note: definitely not the same issue as the closed issue #1042, moreover I also checked with EasyBuild 3.2.0)

This made me switch to manual compiles, using an arch file that I adapted based on information I got from Andreas Glöss. Which made me bump in two other issues:

The standard regression test script uses mpiexec, but on our cluster, this should be mpirun or mpiexec.hydra. MPD does not run anymore. This explains why sometimes all regression tests fail, something I have seen happening on at least 1 other VSC cluster also so it is not linked to our setup alone. It turns out that the script that performs regression tests has a command line option (-mpiexec) to set the MPI starter, and this can also be passed via the makefile that triggers the testing through the TESTOPTS variable. I haven't checked if I can do that through an easyconfig for the EB_CP2K easyblock, but it was definitely the trick to significantly improve the test results for my manual builds.
Adding additional components one by one (I used libxsmm, libxc and Libint), I learned that the Libint that I compiled with a pretty standard EasyConfig did not work properly with CP2K, causing several failed regression tests. I recompiled Libint according to instructions I found in the Linux-x86-64-Intel-mic.psmp arch file that I had modified for my experiments with FFTW, and that solved all problems. The result was a popt executable containing FFTW, libxsmm, libxc 3 and Libint 1.1.6 that passed all regression tests (2855 correct and 10 new tests, no failed or wrong tests). So it looks like other options are needed in the easyconfig and/or easyblock of Libint to work properly with CP2K.

To help in reconstructing what I have done, I include:

The arch file that I modified: Linux-x86-64-intel-mic.psmp. I used this file via Linux-x86-64-intel-host.popt which includes this file. As I only tested a few limited popt-based configurations, I do not claim that this file is without errors!
A MakeCp-based easyconfig that reconstructs the compile process I did for CP2K.
A ConfigureMake-based easyconfig that automates the way I compiled Libint.

I don't have enough knowledge of Python and the innards of EasyBuild nor the time to implement these ideas in the CP2K and Libint easyblocks and easyconfigs myself beyond what I have done above.

CP2K-experiments-UA.zip

wpoely86 commented 7 years ago

@klust the attached zip is empty

klust commented 7 years ago

I hope this one works better...

CP2K-experiments-UA.zip

boegel commented 7 years ago

@klust Thank you for the excellent feedback!

The problem with mpirun has been fixed now in https://github.com/hpcugent/easybuild-framework/pull/2221, which will be part of EasyBuild v3.3.0.

I'll try and find time to look into your other suggestions/fixes, unless someone else beats me to it, of course.

It's good to know that you can come up with a setup in which all CP2K regression tests pass, we sort of always blamed the tests themselves for partially failing...

boegel commented 6 years ago

Note to self: the problem with -I -heap-arrays 64 (i.e. a missing argument to -I) occurs because no value is set for $(INTEL_INCF) used by the Makefile when FFTW is also being used together with Intel MKL. I'll look into fixing that.

boegel commented 6 years ago

@klust With the easyconfigs you provided I indeed also et 0 failed/wrong tests when installing on CentOS 7 & Intel Sandy Bridge, but on Intel Haswell I'm seeing 3 wrong tests (see below).

 --------------------------------------------------------------------------
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-1-4/water_atprop_ewald.inp.out :
  POTENTIAL ENERGY : ref = 0.375664704477E-02 new = 0.375664704476E-02
  relative error :   2.66189859e-12 >  numerical tolerance = 1.0E-14
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-no-restraint.inp.out :
  POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
  relative error :   3.74798766e-12 >  numerical tolerance = 2e-12
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 /tmp/build/CP2K/4.1/intel-2017a-FFTW/cp2k-4.1/regtesting/Linux-x86-64-intel-host/popt/TEST-Linux-x86-64-intel-host-popt-2017-10-25_10-41-59/Fist/regtest-4/ethene-ck-restraint.inp.out :
  POTENTIAL ENERGY : ref = 0.00080042617086399995 new = 0.800426170861E-03
  relative error :   3.74798766e-12 >  numerical tolerance = 2e-12
 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 --------------------------------------------------------------------------
  ...
 --------------------------------- Summary --------------------------------
 Number of FAILED  tests 0
 Number of WRONG   tests 3
 Number of CORRECT tests 2852
 Number of NEW     tests 10
 Total number of   tests 2865

Maybe the numeric tolerance is a bit strict for those, I can't rely tell.

Can you provide some more details about the system(s) you tested on, for context?

I'm looking into the necessary changes to also achieve these type of test results through the CP2K easyblock, your feedback is really helpful there, thanks again!

klust commented 6 years ago

@boegel I checked our results. When I submitted the request, I had only compiled on Hopper, a ivybridge-based machine. Both on Scientific Linux 6 and CentOS 7 I get 0 wrong results. But when I tried to recompile on Leibniz, which was only after submitting this report as we have had lots of difficulties getting the cluster fully up, I actually also got exactly the same 3 errors as you mentioned and even exactly the same numbers in the error reports as you show above.

Since there were only three errors I haven't looked further into them so far, but my guess is that they may be due to higher round-off errors after some haswell/broadwell-specific optimizations by the compiler or some CPU-specific code in libraries.

So you have a 100% perfect reproduction of what we got.

tovrstra commented 6 years ago

It is probably worth testing if the problem still persists with less aggressive compiler optimization flags. In any case, I would not worry about these small errors.

The CP2K tests are not absolute. They only compare results between two compilations. If there is a difference, you don't know which is wrong or right. Even if all the numbers are the same, both could be wrong. Nobody ever checked if the tests produce numbers comparable to other simulation codes or simple analytic calculations (where possible). They are basically only an aid to detect issues when making a small change to the code or to the way it is compiled.

tovrstra commented 6 years ago

This could be worth a try: https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

TL;DR export MKL_CBWR=COMPATIBLE before running the tests.

boegel commented 6 years ago

@tovrstra I just gave that a try, it does indeed dramatically decrease the number of failed tests on Intel Haswell (for CP2K 4.1 with intel/2017a, after some other fixes to the CP2K easyblock, now also trying with CP2K 5.1 with intel/2017b).

However, since that basically disables the use of AVX & beyond by Intel MKL, shouldn't this always be set when using CP2K with Intel MKL 11.x or more recent? And if so, how does that affect performance?

This really boils to a performance vs accuracy, and we need to be careful which one we sacrifice (most).

tovrstra commented 6 years ago

This is indeed a difficult question and I don't have a straight answer. I just noticed also the following:

# === MKL < 11.3 has an interface bug (use this as a workaround) ===
LIBS    += -Wl,--start-group \
             $(INTEL_MKL_LIB)/libmkl_intel_lp64.a \
             $(INTEL_MKL_LIB)/libmkl_core.a \
             $(INTEL_MKL_LIB)/libmkl_sequential.a \
           -Wl,--end-group \
           -lpthread -lm -ldl
# === MKL < 11.3 'Link Line Advisor' ===
#LIBS    += -Wl,--start-group \
#             $(INTEL_MKL_LIB)/libmkl_gf_lp64.a \
#             $(INTEL_MKL_LIB)/libmkl_core.a \
#             $(INTEL_MKL_LIB)/libmkl_sequential.a \
#           -Wl,--end-group \
#           -lpthread -lm

on https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/rev_18111.txt (linked from https://dashboard.cp2k.org/archive/gcc492-mkl1121-sopt/index.html) Could this be related?

boegel commented 6 years ago

@tovrstra EasyBuild correctly uses the combination of -lmkl_intel_lp64 -lmkl_sequential -lmkl_core already (and I suspect we would run into hard compilation failures if we weren't).

tovrstra commented 6 years ago

ok. The MKL_CBWR=COMPATIBLE setting is probably only useful for extremely detailed comparisons. If it really matters for a calculation, I would say there is a bug in the code. Useful algorithms should be robust enough to handle small numerical noise issues. That said, CP2K may contain a few of these bugs. :(

hfp commented 6 years ago

One goal is to validate important/complete HPC codes for every compiler release. I am working on it, but until then I can only recommend compiler releases that are known to work: http://xconfigure.readthedocs.io/cp2k/README/#sanity-check, i.e. I did not (successfully) validate Intel 2018 suite.

boegel commented 6 years ago

TODO: enhance CP2K easyblock to automatically pick right DFLAGS based on how Libint was configured...

easybuilders / easybuild-easyblocks

CP2K fails with recent Intel compilers. #1174