DMC LocalECP incorrect in GPU code on titan using CUDA 9.1

jtkrogel commented 5 years ago

Disagreement between CPU and GPU DMC total energies was observed for a water molecule in periodic boundary conditions (8 A cubic cell, CASINO pseudopotentials, Titan at OLCF, QMCPACK 3.6.0). Issue originally reported by Andrea Zen. Original inputs and outputs: TEST_DMC.zip

From the attached outputs, the VMC energies agree, while the DMC energies differ by about 0.3 Ha:

#VMC
>qmca -q e *s001*scalar*
dmc_cpu  series 1  LocalEnergy           =  -17.183577 +/- 0.007486 
dmc_gpu  series 1  LocalEnergy           =  -17.152789 +/- 0.018592 

#DMC
>qmca -q e *s002*scalar*
dmc_cpu  series 2  LocalEnergy           =  -17.220971 +/- 0.000968 
dmc_gpu  series 2  LocalEnergy           =  -16.869061 +/- 0.003256

The difference is entirely attributable to the local part of the ECP:

#DMC
>qmca -q l *s002*scalar*
dmc_cpu  series 2  LocalECP              =  -41.436580 +/- 0.021199 
dmc_gpu  series 2  LocalECP              =  -41.026695 +/- 0.028982

Note: the DMC error bars are not statistically meaningful here (10 blocks), but the difference is large enough to support this conclusion.

The oddity here is that the error is only seen in DMC and it is limited to a single potential energy term. This may indicate a bug in LocalECP that surfaces with increased walker count on the GPU (1 walker/gpu in VMC, 320 walkers/gpu in DMC). Likely, a series of VMC runs with increasing number of walkers will show this.

prckent commented 5 years ago

The local ECP kenel is one that is known to not be reproducibile between runs, i.e. is buggy. Something to do with walker and GPU thread/block count. Previously the differences have been small enough to be ignorable; this problem indicates it must be fixed. There are a couple of issues on this.

You don't state explicitly, but is the non-local ECP term correct?

jtkrogel commented 5 years ago

The non-local ECP term appears to be correct.

prckent commented 5 years ago

To save time debugging this, for the next 3 weeks the necessary pwscf file is at https://ftp.ornl.gov/filedownload?ftp=e;dir=WATER Replace WATER with uP24qpBh6M3N

prckent commented 5 years ago

I did some VMC experimentation. On a single Kepler GPU with a fixed seed and either 1 or 320 walkers, I was able to reproduce the previously noticed non-determinism with just a few moves. i.e. Multiple runs of the executable generate slightly different results. From this short run and my current inputs we can't say if the energies are "bad" but the local electron ion and electron-electron terms are not repeatable. The much harder to compute kinetic energy and non-local electron-ion are repeatable (?!).

prckent commented 5 years ago

VMC runs with 320 walkers are essentially the same, i.e. no 0.3 Ha shift.

All inputs and outputs from test including wavefunction: https://ftp.ornl.gov/filedownload?ftp=e;dir=ICE Replace ICE with uP21fJWh6csV

  <qmc method="vmc" move="pbyp" gpu="yes">
    <parameter name="blocks">      40 </parameter>
    <parameter name="substeps">    1 </parameter>
    <parameter name="steps">       100 </parameter>
    <parameter name="warmupSteps">  500 </parameter>
    <parameter name="usedrift">     no </parameter>
    <parameter name="timestep">    0.3 </parameter>
    <parameter name="walkers">    320 </parameter>
  </qmc>

qmca -e 0 vmc*.dat

vmc_cuda  series 1
  LocalEnergy           =          -17.1638 +/-           0.0011
  Variance              =            0.4991 +/-           0.0063
  Kinetic               =            13.508 +/-            0.018
  LocalPotential        =           -30.672 +/-            0.018
  ElecElec              =           11.1265 +/-           0.0097
  LocalECP              =           -41.409 +/-            0.019
  NonLocalECP           =           -1.3970 +/-           0.0095
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           295.097 +/-            0.036
  BlockWeight           =          32000.00 +/-             0.00
  BlockCPU              =             1.248 +/-            0.018
  AcceptRatio           =           0.47567 +/-          0.00017
  Efficiency            =           1908.34 +/-             0.00
  TotalTime             =             49.91 +/-             0.00
  TotalSamples          =           1280000 +/-                0

vmc_omp  series 1
  LocalEnergy           =          -17.1718 +/-           0.0012
  Variance              =            0.5031 +/-           0.0092
  Kinetic               =            13.510 +/-            0.016
  LocalPotential        =           -30.682 +/-            0.016
  ElecElec              =           11.1155 +/-           0.0087
  LocalECP              =           -41.408 +/-            0.017
  NonLocalECP           =           -1.3964 +/-           0.0094
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           295.375 +/-            0.039
  BlockWeight           =          32000.00 +/-             0.00
  BlockCPU              =            1.0728 +/-           0.0024
  AcceptRatio           =           0.47613 +/-          0.00015
  Efficiency            =           1885.79 +/-             0.00
  TotalTime             =             42.91 +/-             0.00
  TotalSamples          =           1280000 +/-                0

prckent commented 5 years ago

@jtkrogel where and how were you able to produce the cpu-gpu energy shift? machine, qmcpack version, software versions, node/mpi/thread counts etc.

In my DMC tests so far I have not found such a sizable shift.

jtkrogel commented 5 years ago

The results are from runs performed by Andrea Zen (@zenandrea) on Titan with QMCPACK 3.6.0 on 4 nodes, 1 mpi task per node, 1 thread per mpi task (see files job_qmcpack_gpu-titan, input_dmcgpu.xml, and out_dmcgpu in TEST_DMC.zip).

The build details, as far as I know, are according to our build_olcf_titan.sh script, but with changes to the boost and fftw libraries as follows: boost/1.62.0 fftw/3.3.4.11). Presumably with the real AoS code.

@zenandrea, please check if I have missed something.

zenandrea commented 5 years ago

Dear @jtkrogel and @prckent, almost everything as you told, but I used fftw/3.3.4.8, which is loaded as default. I confirm that I compiled for real AoS code.

In particular, this is my compilations script:

export CRAYPE_LINK_TYPE=dynamic module swap PrgEnv-pgi PrgEnv-gnu
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1 module load cray-hdf5-parallel module load cmake3 module load fftw export FFTW_HOME=$FFTW_DIR/.. module load boost/1.67.0 export CC=cc export CXX=CC mkdir build_titan_gpu cd build_titan_gpu cmake -DQMC_CUDA=1 ..
cmake -DQMC_CUDA=1 ..
make -j 8 ls -l bin/qmcpack

prckent commented 5 years ago

Thanks. Nothing unreasonable in the above. It should work without problems.

FFTW would not cause the failures. If FFTW were wrong - and I don't recall a single case ever where it has been - the kinetic energy and Monte Carlo walk in general would also be wrong.

prckent commented 5 years ago

I have reproduced this problem using the current develop version and with builds that pass the unit and diamond and LiH integration tests. I used the updated build script #1472 i.e. Nothing out of the ordinary.

Using 1MPI, 16 OMP threads and 0/1 GPU I have a 0.6 Hartree (!) difference in the DMC energies (series 2 & 3 below), while the VMC energies agree. The difference is in the local part of the pseudopotential. The analysis below is not done carefully, but it is interesting that the kinetic energy and acceptance ratio appear to match between CPU and GPU.

A 4 node run shows a slightly smaller disagreement between the codes.

qmca -q ev ../titan_orig*/*.scalar.dat

                            LocalEnergy               Variance           ratio
../titan_orig_1mpi/qmc_cpu  series 1  -17.176063 +/- 0.016221   0.595062 +/- 0.154097   0.0346
../titan_orig_1mpi/qmc_cpu  series 2  -17.219573 +/- 0.002273   0.461457 +/- 0.003292   0.0268
../titan_orig_1mpi/qmc_cpu  series 3  -17.220429 +/- 0.001601   0.490561 +/- 0.007181   0.0285

../titan_orig_1mpi/qmc_gpu  series 1  -17.155363 +/- 0.025336   0.467373 +/- 0.056839   0.0272
../titan_orig_1mpi/qmc_gpu  series 2  -16.647208 +/- 0.000720   1.010610 +/- 0.005110   0.0607
../titan_orig_1mpi/qmc_gpu  series 3  -16.639882 +/- 0.001205   1.026227 +/- 0.007102   0.0617

pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_cpu.s003.scalar.dat

../titan_orig_1mpi/qmc_cpu  series 3
  LocalEnergy           =          -17.2187 +/-           0.0020
  Variance              =            0.4878 +/-           0.0063
  Kinetic               =            13.587 +/-            0.024
  LocalPotential        =           -30.805 +/-            0.025
  ElecElec              =            11.115 +/-            0.015
  LocalECP              =           -41.502 +/-            0.031
  NonLocalECP           =            -1.425 +/-            0.016
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           296.972 +/-            0.073
  BlockWeight           =         634774.40 +/-          1923.92
  BlockCPU              =            302.38 +/-             1.12
  AcceptRatio           =          0.993562 +/-         0.000029
  Efficiency            =              0.93 +/-             0.00
  TotalTime             =           1511.88 +/-             0.00
  TotalSamples          =           3173872 +/-                0
pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_gpu.s003.scalar.dat

../titan_orig_1mpi/qmc_gpu  series 3
  LocalEnergy           =          -16.6399 +/-           0.0012
  Variance              =            1.0262 +/-           0.0071
  Kinetic               =            13.533 +/-            0.019
  LocalPotential        =           -30.173 +/-            0.019
  ElecElec              =            11.032 +/-            0.012
  LocalECP              =           -40.787 +/-            0.025
  NonLocalECP           =           -1.4246 +/-           0.0066
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           277.912 +/-            0.042
  BlockWeight           =         638124.30 +/-          1124.31
  BlockCPU              =            26.026 +/-            0.039
  AcceptRatio           =          0.993609 +/-         0.000016
  Efficiency            =             14.94 +/-             0.00
  TotalTime             =            260.26 +/-             0.00
  TotalSamples          =           6381243 +/-                0

prckent commented 5 years ago

Also worth noting that the DMC energy is above the VMC one...

prckent commented 5 years ago

Attempting to bracket the problem:

QMCPACK v.3.1.1 (August 2017) also has the error. i.e. It is not a recently introduced bug in our source code.
Using the lastest develop version but with no Jastrow in the wavefunction the bug persists.

Still puzzling is why our existing carbon diamond or LiH tests don't trigger this bug.

prckent commented 5 years ago

Using the BFD potentials from examples/molecules/H2O the problem persists. This rules out handling of CASINO format potentials. Again the DMC energy is above the VMC energy on the GPU while the CPU result appears OK.

                           LocalEnergy               Variance           ratio
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 1  -17.017532 +/- 0.053990   3.475117 +/- 0.377453   0.2042
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 2  -17.257461 +/- 0.003199   3.439663 +/- 0.020524   0.1993
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 3  -17.271529 +/- 0.003633   3.671973 +/- 0.031433   0.2126

../titan_orig_1mpi_noj_bfd/qmc_gpu  series 1  -16.898081 +/- 0.064148   3.766366 +/- 0.306030   0.2229
../titan_orig_1mpi_noj_bfd/qmc_gpu  series 2  -16.694704 +/- 0.005017   4.001500 +/- 0.038960   0.2397
../titan_orig_1mpi_noj_bfd/qmc_gpu  series 3  -16.687953 +/- 0.002943   4.170178 +/- 0.020878   0.2499

prckent commented 5 years ago

Persists with no MPI, -DQMC_MPI=0

prckent commented 5 years ago

By varying the number of walkers I was able to break VMC (good suggestion by @jtkrogel ). The bug is back to looking like a bad kernel.

prckent commented 5 years ago

The linked VMC test gives incorrect results on titan. titan_vmc_only.zip 146.46 MB https://ftp.ornl.gov/filedownload?ftp=e;dir=FRUIT Replace FRUIT with uP10HwMh8qGU

Puzzlingly, these same files give correct results on oxygen (Intel Xeon + Kepler + clang6 +cuda 10.0 currently). A naively incorrect kernel would give reproducible errors.

atillack commented 5 years ago

@prckent I can reproduce your numbers on Titan.

atillack commented 5 years ago

@prckent When I go back to Cuda 7.5 (using Gcc 4.9.3 and an older version of QMCPACK) I get the correct results:

qmc_gpu series 1 LocalEnergy = -17.1716 +/- 0.0021 Variance = 0.490 +/- 0.017 Kinetic = 13.481 +/- 0.025 LocalPotential = -30.652 +/- 0.025 ElecElec = 11.129 +/- 0.013 LocalECP = -41.424 +/- 0.029 NonLocalECP = -1.364 +/- 0.014 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.354 +/- 0.074 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.310562 +/- 0.000093 AcceptRatio = 0.47525 +/- 0.00029 Efficiency = 16660.91 +/- 0.00 TotalTime = 19.57 +/- 0.00 TotalSamples = 161280 +/- 0

So this could be an issue with the Cuda installation on Titan...

prckent commented 5 years ago

@atillack Interesting. If you are using a standalone workstation with CUDA 7.5 (!), the question is whether you can break VMC by e.g. varying the number of walkers, or if running Andrea's original DMC case still breaks.

jtkrogel commented 5 years ago

@atillack Is there a specific build config + QMCPACK version you can recommend that does not display the problem on Titan? This may represent a practical way @zenandrea can get correct production runs sooner.

atillack commented 5 years ago

@jtkrogel QMPACK 3.5.0

Here are the modules I have loaded (for gcc/4.9.3, module unload gcc; module load gcc/4.9.3 after "module swap PrgEnv-pgi PrgEnv-gnu" works):

Currently Loaded Modulefiles: 1) eswrap/1.3.3-1.020200.1280.0 2) craype-network-gemini 3) craype/2.5.13 4) cray-mpich/7.6.3 5) craype-interlagos 6) lustredu/1.4 7) xalt/0.7.5 8) git/2.13.0 9) module_msg/0.1 10) modulator/1.2.0 11) hsi/5.0.2.p1 12) DefApps 13) cray-libsci/16.11.1 14) udreg/2.3.2-1.0502.10518.2.17.gem 15) ugni/6.0-1.0502.10863.8.28.gem 16) pmi/5.0.12 17) dmapp/7.0.1-1.0502.11080.8.74.gem 18) gni-headers/4.0-1.0502.10859.7.8.gem 19) xpmem/0.1-2.0502.64982.5.3.gem 20) dvs/2.5_0.9.0-1.0502.2188.1.113.gem 21) alps/5.2.4-2.0502.9774.31.12.gem 22) rca/1.0.0-2.0502.60530.1.63.gem 23) atp/2.1.1 24) PrgEnv-gnu/5.2.82 25) cray-hdf5/1.10.0.3 26) cmake3/3.9.0 27) fftw/3.3.4.8 28) boost/1.62.0 29) subversion/1.9.3 30) cudatoolkit/7.5.18-1.0502.10743.2.1 31) gcc/4.9.3

atillack commented 5 years ago

@prckent @jtkrogel I just looked into the Cuda 9 changelog and found this wonderful snippet:

The compiler has transitioned to a new code-generation back end for Kepler GPUs. PTXAS now includes a new option --new-sm3x-opt=false that allows developers to continue using the legacy back end. Use ptxas --help to get more information about these command-line options.

This at least may explain what is going on. I am not sure how to pass down this parameter to ptxas though ...

Edit: Testing now.

atillack commented 5 years ago

@prckent @jtkrogel Cuda 7.5 is still the temporary solution. The ptxas flag (-Xptxas --new-sm3x-opt=false can be put in CUDA_NVCC_FLAGS) only helps to get results halfway to the correct number on Cuda 9.1 on Titan:

qmc_gpu series 1 LocalEnergy = -16.9815 +/- 0.0021 Variance = 0.797 +/- 0.015 Kinetic = 13.483 +/- 0.022 LocalPotential = -30.465 +/- 0.022 ElecElec = 11.125 +/- 0.012 LocalECP = -41.235 +/- 0.025 NonLocalECP = -1.362 +/- 0.013 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 289.167 +/- 0.073 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.302379 +/- 0.000059 AcceptRatio = 0.47550 +/- 0.00025 Efficiency = 12570.17 +/- 0.00 TotalTime = 24.49 +/- 0.00 TotalSamples = 207360 +/- 0

atillack commented 5 years ago

@prckent @jtkrogel After talking with our Nvidia representatives, there is a code generation regression in 9.1 which is fixed in 9.2. So on Titan, it seems the only work-around is to use 7.5 for the time being.

If a newer version than QMCPACK 3.5.0 is needed some (minor) code changes are needed in order to compile with Cuda 7.5:

lines containing cudamemadvise need to be commented out in QMCWaveFunctions/EinsplineSetCuda.cpp
"#include " needs to be commented out in Platforms/devices.h
CMake/GNUCompilers.cmake needs to be changed to accept compilers after 4.8 (second line 5.0 needs changing to 4.8 like in older version of QMCPACK)

atillack commented 5 years ago

@prckent @jtkrogel Another data point. I also get correct results if the Cuda 9.1 toolkit is loaded when executing QMCPACK that was compiled with Cuda 7.5. This does seem to point to the code generation being the issue.

atillack commented 5 years ago

@prckent @jtkrogel On Summit using Cuda 9.2 the correct results are also obtained:

qmc_gpu series 1 LocalEnergy = -17.1707 +/- 0.0020 Variance = 0.489 +/- 0.016 Kinetic = 13.480 +/- 0.025 LocalPotential = -30.651 +/- 0.025 ElecElec = 11.128 +/- 0.013 LocalECP = -41.421 +/- 0.029 NonLocalECP = -1.364 +/- 0.014 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.321 +/- 0.073 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.179291 +/- 0.000020 AcceptRatio = 0.47529 +/- 0.00029 Efficiency = 29197.52 +/- 0.00 TotalTime = 11.47 +/- 0.00 TotalSamples = 163840 +/- 0

zenandrea commented 5 years ago

Dear @atillack @prckent @jtkrogel it seems very likely that the source of the issue was the cudatoolkit version 9.1. Shall we ask the OLCF's system administrators if they can install the 9.2 version?

Maybe there might be other packages other than qmcpack affected by this kind of problem!

prckent commented 5 years ago

@zenandrea Please ask - I am not sure that 9.2 will be installed given that Titan has only a few more months of accessibility, but other packages are certainly at risk. Are you able to move to Summit or is your time only on Titan?

This is a scary problem and I am not keen on recommending use of older software.

zenandrea commented 5 years ago

@prckent I have half the resources on titan and half on summit. I'm going to ask straight away.

atillack commented 5 years ago

@prckent @zenandrea As Cuda 9.1's behavior was seen as mostly a performance regression, the Nvidia folks are looking at our kernel giving bad numbers under 9.1 to see if there's a possible workaround.

@zenandrea It's a good idea to ask but like Paul I am uncertain if this will happen in time to be useful. In the interim, with small code changes (see post above) it is possible to compile a current version of QMCPACK on Titan with Cuda 7.5 but this only works with GCC 4.9.3 as otherwise modules are missing.

prckent commented 5 years ago

I am still open to the idea that we have illegal/buggy code, and that different CUDA versions, GPUs, etc. expose the problem in different ways. However "bad generated code" is the best explanation given the established facts. What is so strange still is that all the difficult and costly parts of the calculation involving the wavefunction are correct.

ye-luo commented 5 years ago

I have a solution to use 7.5 with the current QMCPACK. Will PR soon.

atillack commented 5 years ago

@ye-luo Thanks!

ye-luo commented 5 years ago

I failed to find a clean solution through the source because I need to hack cmake. To enable our production needs, I'm making all the build variants and will put it in a place anyone can access.

prckent commented 5 years ago

I'll note that an initialization bug similar to #1518 could explain these problems.

ye-luo commented 5 years ago

I checked. Un fortunately #1518 is not related to this bug.

atillack commented 5 years ago

@prckent The problem seems contained to Titan. Cuda 9.1 on Summit also gives the correct results:

qmc_gpu series 1 LocalEnergy = -17.1703 +/- 0.0020 Variance = 0.496 +/- 0.017 Kinetic = 13.479 +/- 0.024 LocalPotential = -30.650 +/- 0.024 ElecElec = 11.128 +/- 0.012 LocalECP = -41.420 +/- 0.028 NonLocalECP = -1.365 +/- 0.013 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.316 +/- 0.071 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.182407 +/- 0.000021 AcceptRatio = 0.47538 +/- 0.00028 Efficiency = 27168.89 +/- 0.00 TotalTime = 12.40 +/- 0.00 TotalSamples = 174080 +/- 0

ye-luo commented 5 years ago

I put both v3.6 and 3.7 binaries at /lustre/atlas/world-shared/mat189/qmcpack_binaries_titan They should last till the retirement of Titan.

To workaround the bug in CUDA 9.1 which gives wrong results. The following steps are taken to compile CudaCoulomb.cu with CUDA 7.5. After building QMCPACK CUDA version, 1) From the build folder, cd src/QMCHamiltonians 2) find -name qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake and open it with an editor. 3) touch ./CMakeFiles/qmcham.dir/qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake 4) Modify CUDA_HOST_COMPILER from /opt/cray/craype/2.5.13/bin/cc to /opt/gcc/4.9.3/bin/gcc 5) Replace all cudatoolkit9.1/9.1.85_3.10-1.0502.df1cc54.3.1 to cudatoolkit7.5/7.5.18-1.0502.10743.2.1 6) Type make -j32 and you see "Built target qmcham". If CMake is triggered, repleat step 2-4 because CMake overwrites qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake. 7) cd ../QMCApp ; sh CMakeFiles/qmcpack.dir/link.txt

prckent commented 5 years ago

I modified the title for posterity to record the actual determined problem. Should we close this issue? @zenandrea are the new binaries working for you?

zenandrea commented 5 years ago

Dear @prckent, new binaries seems to work well on the cases I have tested so far.

QMCPACK / qmcpack

DMC LocalECP incorrect in GPU code on titan using CUDA 9.1 #1440