UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 27 forks source link

Update request: PLUMED 2.4.0 on GROMACS 2016.4 (IN:02766079) #136

Closed owainkenwayucl closed 6 years ago

owainkenwayucl commented 6 years ago

A user has requested an install of the new version of PLUMED (2.4.0) and a new version of GROMACS (2016.4) on Legion, Grace, Thomas.

heatherkellyucl commented 6 years ago

Flags look the same for plumed 2.4.0 as 2.3.1, starting test build. VERSION=2.4.0 SHA1=25242eb66f4a8fbb4bff66745fbf927b7f4cd32e ./plumed-2.3.1_install

heatherkellyucl commented 6 years ago
+ ERROR in test analysis/rt-pca/
+ check file analysis/rt-pca/report.txt for more information
+ ERROR in test basic/rt63c/
+ check file basic/rt63c/report.txt for more information
+ ERROR in test basic/rt63c-mpi/
+ check file basic/rt63c-mpi/report.txt for more information
+ ERROR in test basic/rt63d/
+ check file basic/rt63d/report.txt for more information
+ ERROR in test basic/rt64-pca/
+ check file basic/rt64-pca/report.txt for more information
+ ERROR in test basic/rt65-rmsd2/
+ check file basic/rt65-rmsd2/report.txt for more information
+ ERROR in test basic/rt-close-structure/
+ check file basic/rt-close-structure/report.txt for more information
+ ERROR in test basic/rt-multi-1/
+ check file basic/rt-multi-1/report.txt for more information
+ ERROR in test crystallization/rt-sean-marks/
+ check file crystallization/rt-sean-marks/report.txt for more information
+ ERROR in test isdb/rt-emmi/
+ check file isdb/rt-emmi/report.txt for more information
+ ERROR in test isdb/rt-jcouplings/
+ check file isdb/rt-jcouplings/report.txt for more information
+ ERROR in test isdb/rt-jcouplings-mi/
+ check file isdb/rt-jcouplings-mi/report.txt for more information
+ ERROR in test mapping/rt-pathtools-2/
+ check file mapping/rt-pathtools-2/report.txt for more information
+ ERROR in test mapping/rt-pathtools-3/
+ check file mapping/rt-pathtools-3/report.txt for more information
+ ERROR in test mapping/rt-pca/
+ check file mapping/rt-pca/report.txt for more information
+ ERROR in test mapping/rt-pca-multi/
+ check file mapping/rt-pca-multi/report.txt for more information
+ ERROR in test mapping/rt-tpath/
+ check file mapping/rt-tpath/report.txt for more information
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Final report:
+ 279 tests performed, 122 tests not applicable
+ 17 errors found

Most of the errors were smallish numeric ones, but this one had NaNs.

Thu 18 Jan 12:14:20 GMT 2018
Running regtest in /dev/shm/tmp.e44QyjOPnG/plumed2-2.4.0/regtest/mapping/rt-tpath
++ Test type: driver
++ Arguments: --plumed plumed.dat --trajectory-stride 50 --timestep 0.005 --ixyz diala_traj_nm.xyz --dump-fo
rces forces --dump-forces-fmt=%10.6f
++ Processors: 0
/dev/shm/tmp.e44QyjOPnG/plumed2-2.4.0/regtest/mapping/rt-tpath/tmp
FAILURE
Diff for colvar:
2,547c2,547
<  0.000000  21.4988   0.0807   1.9191   0.0115
<  0.250000  20.9438   0.0226   0.2895   0.0084
<  0.500000  20.6238   0.0050   0.7852   0.0072
<  0.750000  20.5725   0.0067   0.6334   0.0051
<  1.000000  21.6318   0.0343   1.3683   0.0083

...

<  135.750000  39.9775   0.0226  41.7694   0.0055
<  136.000000  40.1872   0.0060  42.4231   0.0080
<  136.250000  39.2643   0.0889  42.6489   0.0073
---
>  0.000000  21.4988   0.0807   1.5201     -nan
>  0.250000  20.9438   0.0226   1.4708     -nan
>  0.500000  20.6238   0.0050  -2.5346     -nan
>  0.750000  20.5725   0.0067  -0.4449     -nan
>  1.000000  21.6318   0.0343   1.4984     -nan

...

>  135.750000  39.9775   0.0226  41.5095     -nan
>  136.000000  40.1872   0.0060  50.2911     -nan
>  136.250000  39.2643   0.0889  52.6817     -nan

Will start by adding -fp-model strict.

heatherkellyucl commented 6 years ago

That only fixed 5 errors.

+ ERROR in test analysis/rt-pca/
+ check file analysis/rt-pca/report.txt for more information
+ ERROR in test basic/rt63c/
+ check file basic/rt63c/report.txt for more information
+ ERROR in test basic/rt63c-mpi/
+ check file basic/rt63c-mpi/report.txt for more information
+ ERROR in test basic/rt63d/
+ check file basic/rt63d/report.txt for more information
+ ERROR in test basic/rt64-pca/
+ check file basic/rt64-pca/report.txt for more information
+ ERROR in test basic/rt65-rmsd2/
+ check file basic/rt65-rmsd2/report.txt for more information
+ ERROR in test basic/rt-close-structure/
+ check file basic/rt-close-structure/report.txt for more information
+ ERROR in test isdb/rt-emmi/
+ check file isdb/rt-emmi/report.txt for more information
+ ERROR in test mapping/rt-pathtools-2/
+ check file mapping/rt-pathtools-2/report.txt for more information
+ ERROR in test mapping/rt-pathtools-3/
+ check file mapping/rt-pathtools-3/report.txt for more information
+ ERROR in test mapping/rt-pca/
+ check file mapping/rt-pca/report.txt for more information
+ ERROR in test mapping/rt-pca-multi/
+ check file mapping/rt-pca-multi/report.txt for more information
+ ERROR in test mapping/rt-tpath/
+ check file mapping/rt-tpath/report.txt for more information
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Final report:
+ 279 tests performed, 122 tests not applicable
+ 13 errors found

mapping/rt-tpath/report.txt still has NaNs.

heatherkellyucl commented 6 years ago

2.4.0 uses c++11. I'm wondering if our gcc is too old.

heatherkellyucl commented 6 years ago

It thinks gcc 4.8.1 and intel 15 and up should be sufficient.

heatherkellyucl commented 6 years ago

Out of curiosity, I'm trying it on Grace to see if the results are the same (AVX).

heatherkellyucl commented 6 years ago

Also trying one on Legion with compilers/intel/2017/update1 and mpi/intel/2017/update1/intel as that's what plumed 2.3.1 was built with.

heatherkellyucl commented 6 years ago

Same 13 failed on Grace.

heatherkellyucl commented 6 years ago

A little way in to the tests with the first Intel 2017 on Legion, still got NaNs. (And same 13 tests failed).

heatherkellyucl commented 6 years ago

There are newer versions now - try those.

heatherkellyucl commented 6 years ago

Ah, this is the problem I was getting in the tests: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/plumed-users/pXU_PGjkF1I

This is just to notify that there is a bug in the calculation of RMSD and all RMSD derived quantities in PLUMED v2.4.0, appearing when using recent intel compiler (and perhaps other recent compilers). The bug basically appears when using SIMD instructions, which are enabled by compilers implementing OpenMP 4.0

https://github.com/plumed/plumed2/pull/343

I discovered the problem running the regression tests on a machine where I compiled with a recent intel compiler. Unfortunately (or fortunately?) at SISSA we have very old compilers, so I had no way to detect the problem in the past. Now I have access to a machine in CINECA where I compiled PLUMED 2.4. There is a collection of tests that reproducibly give incorrect results with intel 17. Notice that when using intel 17 several tests give slightly incorrect results (just because we store too many digits in the reference files), this is not a problem. But some regtests were reporting nan or other strange values. I noticed that all of them were involving some form of alignment, tracked down the problem to RMSD calculation, had a look at the simd instructions that we added recently, and discovered the bug.

Notice that not all the tests using RMSD are failing (actually, only a minority of them). In particular, those that crashed were using one of the following keywords: PCA FIT_TO_TEMPLATE PATH (no problem detected with PATHMSD)

E.g., test basic/rt63c computes the RMSD correctly, then apply FIT_TO_TEMPLATE and computes again the RMSD that is now incorrect. None of the test using PATHMSD is reporting errors.

Fixed in 2.4.1, so I'll try that.

heatherkellyucl commented 6 years ago
+ ERROR in test basic/rt-multi-1/
+ check file basic/rt-multi-1/report.txt for more information
+ ERROR in test crystallization/rt-sean-marks/
+ check file crystallization/rt-sean-marks/report.txt for more information
+ ERROR in test isdb/rt-emmi/
+ check file isdb/rt-emmi/report.txt for more information
+ ERROR in test isdb/rt-jcouplings/
+ check file isdb/rt-jcouplings/report.txt for more information
+ ERROR in test isdb/rt-jcouplings-mi/ 
+ check file isdb/rt-jcouplings-mi/report.txt for more information
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Final report:
+ 279 tests performed, 128 tests not applicable
+ 5 errors found

Checked isdb/rt-emmi/report.txt - those are last decimal place differences all the way through until the last bit where they compound. Will see if fp-model-strict clears them all up.

Checked basic/rt-multi-1/report.txt - these look like the storing too many digits mentioned above...

FAILURE
Diff for ff.0:
8,16c8,16
<  6 726.794919 0.000000
<  7 726.794919 0.000000
<  8 726.794919 0.000000
<  9 726.794919 0.000000
<  10 726.794919 0.000000
<  11 726.794919 0.000000
<  12 726.794919 0.000000
<  13 726.794919 0.000000
<  14 726.794919 0.000000
---
>  6 726.794919  0.000000
>  7 726.794919  0.000000
>  8 726.794919  0.000000
>  9 726.794919  0.000000
>  10 726.794919  0.000000
>  11 726.794919  0.000000
>  12 726.794919  0.000000
>  13 726.794919  0.000000
>  14 726.794919  0.000000
heatherkellyucl commented 6 years ago

The other tests are either single decimal differences or appear the same as the above. isdb/rt-jcouplings-mi/report.txt has a sign flip.

Diff for force.new:
473c473
<  0.012000 66    0.0018309683
---
>  0.012000 66    0.0018309682
476c476
<  0.012000 69   -0.0018309683
---
>  0.012000 69   -0.0018309682
heatherkellyucl commented 6 years ago

fp-model strict: only one error left.

+ ERROR in test isdb/rt-emmi/
+ check file isdb/rt-emmi/report.txt for more information
+++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Final report:
+ 279 tests performed, 128 tests not applicable
+ 1 errors found

all are last digit until it ends with

1813,1815c1813,1815
<  0.000000 1811 374.7142 374.7253
<  0.000000 1812 433.4478 433.4106
<  0.000000 1813 1873.6269 1873.5840
---
>  0.000000 1811 374.7142 374.7192
>  0.000000 1812 433.4478 433.3984
>  0.000000 1813 1873.6269 1873.5901
heatherkellyucl commented 6 years ago

plumed 2.4.1 installs:

heatherkellyucl commented 6 years ago

Tell you what, the results are consistent across machines, even from the old Legion node.

heatherkellyucl commented 6 years ago

When it comes to GROMACS, 2018.1 was released March 21 and there isn't a patch for it in a release of PLUMED yet (there is in github master, alongside the patch for 2016.5).

It may make sense to wait until after Easter for this one, and either patch 2018.1 or if there is no new plumed release, then patch 2016.4.

heatherkellyucl commented 6 years ago

No new plumed, so building gromacs 2016.4 patched with plumed 2.4.1 (containing hrex).

heatherkellyucl commented 6 years ago
module unload compilers mpi
module load compilers/intel/2017/update4 
module load mpi/intel/2017/update3/intel 
module load libmatheval 
module load flex 
module load openblas/0.2.14/intel-2015-update2 
module load plumed/2.4.1/intel-2017-update4
module load gromacs/2016.4/plumed/intel-2017

Running gmx_mpi_d mdrun -h shows at the bottom:

Other options:

 -deffnm <string>
           Set the default filename for all file options
...
 -[no]hrex                  (no)
           Enable hamiltonian replica exchange
heatherkellyucl commented 6 years ago

Informed IN:02912232 and 02766079.