Closed jtkrogel closed 5 years ago
The local ECP kenel is one that is known to not be reproducibile between runs, i.e. is buggy. Something to do with walker and GPU thread/block count. Previously the differences have been small enough to be ignorable; this problem indicates it must be fixed. There are a couple of issues on this.
You don't state explicitly, but is the non-local ECP term correct?
The non-local ECP term appears to be correct.
To save time debugging this, for the next 3 weeks the necessary pwscf file is at https://ftp.ornl.gov/filedownload?ftp=e;dir=WATER Replace WATER with uP24qpBh6M3N
I did some VMC experimentation. On a single Kepler GPU with a fixed seed and either 1 or 320 walkers, I was able to reproduce the previously noticed non-determinism with just a few moves. i.e. Multiple runs of the executable generate slightly different results. From this short run and my current inputs we can't say if the energies are "bad" but the local electron ion and electron-electron terms are not repeatable. The much harder to compute kinetic energy and non-local electron-ion are repeatable (?!).
VMC runs with 320 walkers are essentially the same, i.e. no 0.3 Ha shift.
All inputs and outputs from test including wavefunction: https://ftp.ornl.gov/filedownload?ftp=e;dir=ICE Replace ICE with uP21fJWh6csV
<qmc method="vmc" move="pbyp" gpu="yes">
<parameter name="blocks"> 40 </parameter>
<parameter name="substeps"> 1 </parameter>
<parameter name="steps"> 100 </parameter>
<parameter name="warmupSteps"> 500 </parameter>
<parameter name="usedrift"> no </parameter>
<parameter name="timestep"> 0.3 </parameter>
<parameter name="walkers"> 320 </parameter>
</qmc>
qmca -e 0 vmc*.dat
vmc_cuda series 1
LocalEnergy = -17.1638 +/- 0.0011
Variance = 0.4991 +/- 0.0063
Kinetic = 13.508 +/- 0.018
LocalPotential = -30.672 +/- 0.018
ElecElec = 11.1265 +/- 0.0097
LocalECP = -41.409 +/- 0.019
NonLocalECP = -1.3970 +/- 0.0095
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 295.097 +/- 0.036
BlockWeight = 32000.00 +/- 0.00
BlockCPU = 1.248 +/- 0.018
AcceptRatio = 0.47567 +/- 0.00017
Efficiency = 1908.34 +/- 0.00
TotalTime = 49.91 +/- 0.00
TotalSamples = 1280000 +/- 0
vmc_omp series 1
LocalEnergy = -17.1718 +/- 0.0012
Variance = 0.5031 +/- 0.0092
Kinetic = 13.510 +/- 0.016
LocalPotential = -30.682 +/- 0.016
ElecElec = 11.1155 +/- 0.0087
LocalECP = -41.408 +/- 0.017
NonLocalECP = -1.3964 +/- 0.0094
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 295.375 +/- 0.039
BlockWeight = 32000.00 +/- 0.00
BlockCPU = 1.0728 +/- 0.0024
AcceptRatio = 0.47613 +/- 0.00015
Efficiency = 1885.79 +/- 0.00
TotalTime = 42.91 +/- 0.00
TotalSamples = 1280000 +/- 0
@jtkrogel where and how were you able to produce the cpu-gpu energy shift? machine, qmcpack version, software versions, node/mpi/thread counts etc.
In my DMC tests so far I have not found such a sizable shift.
The results are from runs performed by Andrea Zen (@zenandrea) on Titan with QMCPACK 3.6.0 on 4 nodes, 1 mpi task per node, 1 thread per mpi task (see files job_qmcpack_gpu-titan, input_dmcgpu.xml, and out_dmcgpu in TEST_DMC.zip).
The build details, as far as I know, are according to our build_olcf_titan.sh script, but with changes to the boost and fftw libraries as follows: boost/1.62.0 fftw/3.3.4.11). Presumably with the real AoS code.
@zenandrea, please check if I have missed something.
Dear @jtkrogel and @prckent, almost everything as you told, but I used fftw/3.3.4.8, which is loaded as default. I confirm that I compiled for real AoS code.
In particular, this is my compilations script:
export CRAYPE_LINK_TYPE=dynamic module swap PrgEnv-pgi PrgEnv-gnu
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1 module load cray-hdf5-parallel module load cmake3 module load fftw export FFTW_HOME=$FFTW_DIR/.. module load boost/1.67.0 export CC=cc export CXX=CC mkdir build_titan_gpu cd build_titan_gpu cmake -DQMC_CUDA=1 ..
cmake -DQMC_CUDA=1 ..
make -j 8 ls -l bin/qmcpack
Thanks. Nothing unreasonable in the above. It should work without problems.
FFTW would not cause the failures. If FFTW were wrong - and I don't recall a single case ever where it has been - the kinetic energy and Monte Carlo walk in general would also be wrong.
I have reproduced this problem using the current develop version and with builds that pass the unit and diamond and LiH integration tests. I used the updated build script #1472 i.e. Nothing out of the ordinary.
Using 1MPI, 16 OMP threads and 0/1 GPU I have a 0.6 Hartree (!) difference in the DMC energies (series 2 & 3 below), while the VMC energies agree. The difference is in the local part of the pseudopotential. The analysis below is not done carefully, but it is interesting that the kinetic energy and acceptance ratio appear to match between CPU and GPU.
A 4 node run shows a slightly smaller disagreement between the codes.
qmca -q ev ../titan_orig*/*.scalar.dat
LocalEnergy Variance ratio
../titan_orig_1mpi/qmc_cpu series 1 -17.176063 +/- 0.016221 0.595062 +/- 0.154097 0.0346
../titan_orig_1mpi/qmc_cpu series 2 -17.219573 +/- 0.002273 0.461457 +/- 0.003292 0.0268
../titan_orig_1mpi/qmc_cpu series 3 -17.220429 +/- 0.001601 0.490561 +/- 0.007181 0.0285
../titan_orig_1mpi/qmc_gpu series 1 -17.155363 +/- 0.025336 0.467373 +/- 0.056839 0.0272
../titan_orig_1mpi/qmc_gpu series 2 -16.647208 +/- 0.000720 1.010610 +/- 0.005110 0.0607
../titan_orig_1mpi/qmc_gpu series 3 -16.639882 +/- 0.001205 1.026227 +/- 0.007102 0.0617
pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_cpu.s003.scalar.dat
../titan_orig_1mpi/qmc_cpu series 3
LocalEnergy = -17.2187 +/- 0.0020
Variance = 0.4878 +/- 0.0063
Kinetic = 13.587 +/- 0.024
LocalPotential = -30.805 +/- 0.025
ElecElec = 11.115 +/- 0.015
LocalECP = -41.502 +/- 0.031
NonLocalECP = -1.425 +/- 0.016
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 296.972 +/- 0.073
BlockWeight = 634774.40 +/- 1923.92
BlockCPU = 302.38 +/- 1.12
AcceptRatio = 0.993562 +/- 0.000029
Efficiency = 0.93 +/- 0.00
TotalTime = 1511.88 +/- 0.00
TotalSamples = 3173872 +/- 0
pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_gpu.s003.scalar.dat
../titan_orig_1mpi/qmc_gpu series 3
LocalEnergy = -16.6399 +/- 0.0012
Variance = 1.0262 +/- 0.0071
Kinetic = 13.533 +/- 0.019
LocalPotential = -30.173 +/- 0.019
ElecElec = 11.032 +/- 0.012
LocalECP = -40.787 +/- 0.025
NonLocalECP = -1.4246 +/- 0.0066
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 277.912 +/- 0.042
BlockWeight = 638124.30 +/- 1124.31
BlockCPU = 26.026 +/- 0.039
AcceptRatio = 0.993609 +/- 0.000016
Efficiency = 14.94 +/- 0.00
TotalTime = 260.26 +/- 0.00
TotalSamples = 6381243 +/- 0
Also worth noting that the DMC energy is above the VMC one...
Attempting to bracket the problem:
Still puzzling is why our existing carbon diamond or LiH tests don't trigger this bug.
LocalEnergy Variance ratio
../titan_orig_1mpi_noj_bfd/qmc_cpu series 1 -17.017532 +/- 0.053990 3.475117 +/- 0.377453 0.2042
../titan_orig_1mpi_noj_bfd/qmc_cpu series 2 -17.257461 +/- 0.003199 3.439663 +/- 0.020524 0.1993
../titan_orig_1mpi_noj_bfd/qmc_cpu series 3 -17.271529 +/- 0.003633 3.671973 +/- 0.031433 0.2126
../titan_orig_1mpi_noj_bfd/qmc_gpu series 1 -16.898081 +/- 0.064148 3.766366 +/- 0.306030 0.2229
../titan_orig_1mpi_noj_bfd/qmc_gpu series 2 -16.694704 +/- 0.005017 4.001500 +/- 0.038960 0.2397
../titan_orig_1mpi_noj_bfd/qmc_gpu series 3 -16.687953 +/- 0.002943 4.170178 +/- 0.020878 0.2499
By varying the number of walkers I was able to break VMC (good suggestion by @jtkrogel ). The bug is back to looking like a bad kernel.
The linked VMC test gives incorrect results on titan. titan_vmc_only.zip 146.46 MB https://ftp.ornl.gov/filedownload?ftp=e;dir=FRUIT Replace FRUIT with uP10HwMh8qGU
Puzzlingly, these same files give correct results on oxygen (Intel Xeon + Kepler + clang6 +cuda 10.0 currently). A naively incorrect kernel would give reproducible errors.
@prckent I can reproduce your numbers on Titan.
@prckent When I go back to Cuda 7.5 (using Gcc 4.9.3 and an older version of QMCPACK) I get the correct results:
qmc_gpu series 1 LocalEnergy = -17.1716 +/- 0.0021 Variance = 0.490 +/- 0.017 Kinetic = 13.481 +/- 0.025 LocalPotential = -30.652 +/- 0.025 ElecElec = 11.129 +/- 0.013 LocalECP = -41.424 +/- 0.029 NonLocalECP = -1.364 +/- 0.014 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.354 +/- 0.074 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.310562 +/- 0.000093 AcceptRatio = 0.47525 +/- 0.00029 Efficiency = 16660.91 +/- 0.00 TotalTime = 19.57 +/- 0.00 TotalSamples = 161280 +/- 0
So this could be an issue with the Cuda installation on Titan...
@atillack Interesting. If you are using a standalone workstation with CUDA 7.5 (!), the question is whether you can break VMC by e.g. varying the number of walkers, or if running Andrea's original DMC case still breaks.
@atillack Is there a specific build config + QMCPACK version you can recommend that does not display the problem on Titan? This may represent a practical way @zenandrea can get correct production runs sooner.
@jtkrogel QMPACK 3.5.0
Here are the modules I have loaded (for gcc/4.9.3, module unload gcc; module load gcc/4.9.3 after "module swap PrgEnv-pgi PrgEnv-gnu" works):
Currently Loaded Modulefiles: 1) eswrap/1.3.3-1.020200.1280.0 2) craype-network-gemini 3) craype/2.5.13 4) cray-mpich/7.6.3 5) craype-interlagos 6) lustredu/1.4 7) xalt/0.7.5 8) git/2.13.0 9) module_msg/0.1 10) modulator/1.2.0 11) hsi/5.0.2.p1 12) DefApps 13) cray-libsci/16.11.1 14) udreg/2.3.2-1.0502.10518.2.17.gem 15) ugni/6.0-1.0502.10863.8.28.gem 16) pmi/5.0.12 17) dmapp/7.0.1-1.0502.11080.8.74.gem 18) gni-headers/4.0-1.0502.10859.7.8.gem 19) xpmem/0.1-2.0502.64982.5.3.gem 20) dvs/2.5_0.9.0-1.0502.2188.1.113.gem 21) alps/5.2.4-2.0502.9774.31.12.gem 22) rca/1.0.0-2.0502.60530.1.63.gem 23) atp/2.1.1 24) PrgEnv-gnu/5.2.82 25) cray-hdf5/1.10.0.3 26) cmake3/3.9.0 27) fftw/3.3.4.8 28) boost/1.62.0 29) subversion/1.9.3 30) cudatoolkit/7.5.18-1.0502.10743.2.1 31) gcc/4.9.3
@prckent @jtkrogel I just looked into the Cuda 9 changelog and found this wonderful snippet:
The compiler has transitioned to a new code-generation back end for Kepler GPUs. PTXAS now includes a new option --new-sm3x-opt=false that allows developers to continue using the legacy back end. Use ptxas --help to get more information about these command-line options.
This at least may explain what is going on. I am not sure how to pass down this parameter to ptxas though ...
Edit: Testing now.
@prckent @jtkrogel Cuda 7.5 is still the temporary solution. The ptxas flag (-Xptxas --new-sm3x-opt=false can be put in CUDA_NVCC_FLAGS) only helps to get results halfway to the correct number on Cuda 9.1 on Titan:
qmc_gpu series 1 LocalEnergy = -16.9815 +/- 0.0021 Variance = 0.797 +/- 0.015 Kinetic = 13.483 +/- 0.022 LocalPotential = -30.465 +/- 0.022 ElecElec = 11.125 +/- 0.012 LocalECP = -41.235 +/- 0.025 NonLocalECP = -1.362 +/- 0.013 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 289.167 +/- 0.073 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.302379 +/- 0.000059 AcceptRatio = 0.47550 +/- 0.00025 Efficiency = 12570.17 +/- 0.00 TotalTime = 24.49 +/- 0.00 TotalSamples = 207360 +/- 0
@prckent @jtkrogel After talking with our Nvidia representatives, there is a code generation regression in 9.1 which is fixed in 9.2. So on Titan, it seems the only work-around is to use 7.5 for the time being.
If a newer version than QMCPACK 3.5.0 is needed some (minor) code changes are needed in order to compile with Cuda 7.5:
@prckent @jtkrogel Another data point. I also get correct results if the Cuda 9.1 toolkit is loaded when executing QMCPACK that was compiled with Cuda 7.5. This does seem to point to the code generation being the issue.
@prckent @jtkrogel On Summit using Cuda 9.2 the correct results are also obtained:
qmc_gpu series 1 LocalEnergy = -17.1707 +/- 0.0020 Variance = 0.489 +/- 0.016 Kinetic = 13.480 +/- 0.025 LocalPotential = -30.651 +/- 0.025 ElecElec = 11.128 +/- 0.013 LocalECP = -41.421 +/- 0.029 NonLocalECP = -1.364 +/- 0.014 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.321 +/- 0.073 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.179291 +/- 0.000020 AcceptRatio = 0.47529 +/- 0.00029 Efficiency = 29197.52 +/- 0.00 TotalTime = 11.47 +/- 0.00 TotalSamples = 163840 +/- 0
Dear @atillack @prckent @jtkrogel it seems very likely that the source of the issue was the cudatoolkit version 9.1. Shall we ask the OLCF's system administrators if they can install the 9.2 version?
Maybe there might be other packages other than qmcpack affected by this kind of problem!
@zenandrea Please ask - I am not sure that 9.2 will be installed given that Titan has only a few more months of accessibility, but other packages are certainly at risk. Are you able to move to Summit or is your time only on Titan?
This is a scary problem and I am not keen on recommending use of older software.
@prckent I have half the resources on titan and half on summit. I'm going to ask straight away.
@prckent @zenandrea As Cuda 9.1's behavior was seen as mostly a performance regression, the Nvidia folks are looking at our kernel giving bad numbers under 9.1 to see if there's a possible workaround.
@zenandrea It's a good idea to ask but like Paul I am uncertain if this will happen in time to be useful. In the interim, with small code changes (see post above) it is possible to compile a current version of QMCPACK on Titan with Cuda 7.5 but this only works with GCC 4.9.3 as otherwise modules are missing.
I am still open to the idea that we have illegal/buggy code, and that different CUDA versions, GPUs, etc. expose the problem in different ways. However "bad generated code" is the best explanation given the established facts. What is so strange still is that all the difficult and costly parts of the calculation involving the wavefunction are correct.
I have a solution to use 7.5 with the current QMCPACK. Will PR soon.
@ye-luo Thanks!
I failed to find a clean solution through the source because I need to hack cmake. To enable our production needs, I'm making all the build variants and will put it in a place anyone can access.
I'll note that an initialization bug similar to #1518 could explain these problems.
I checked. Un fortunately #1518 is not related to this bug.
@prckent The problem seems contained to Titan. Cuda 9.1 on Summit also gives the correct results:
qmc_gpu series 1 LocalEnergy = -17.1703 +/- 0.0020 Variance = 0.496 +/- 0.017 Kinetic = 13.479 +/- 0.024 LocalPotential = -30.650 +/- 0.024 ElecElec = 11.128 +/- 0.012 LocalECP = -41.420 +/- 0.028 NonLocalECP = -1.365 +/- 0.013 IonIon = 1.01 +/- 0.00 LocalEnergy_sq = 295.316 +/- 0.071 BlockWeight = 2560.00 +/- 0.00 BlockCPU = 0.182407 +/- 0.000021 AcceptRatio = 0.47538 +/- 0.00028 Efficiency = 27168.89 +/- 0.00 TotalTime = 12.40 +/- 0.00 TotalSamples = 174080 +/- 0
I put both v3.6 and 3.7 binaries at /lustre/atlas/world-shared/mat189/qmcpack_binaries_titan They should last till the retirement of Titan.
To workaround the bug in CUDA 9.1 which gives wrong results. The following steps are taken to compile CudaCoulomb.cu with CUDA 7.5. After building QMCPACK CUDA version, 1) From the build folder, cd src/QMCHamiltonians 2) find -name qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake and open it with an editor. 3) touch ./CMakeFiles/qmcham.dir/qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake 4) Modify CUDA_HOST_COMPILER from /opt/cray/craype/2.5.13/bin/cc to /opt/gcc/4.9.3/bin/gcc 5) Replace all cudatoolkit9.1/9.1.85_3.10-1.0502.df1cc54.3.1 to cudatoolkit7.5/7.5.18-1.0502.10743.2.1 6) Type make -j32 and you see "Built target qmcham". If CMake is triggered, repleat step 2-4 because CMake overwrites qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake. 7) cd ../QMCApp ; sh CMakeFiles/qmcpack.dir/link.txt
I modified the title for posterity to record the actual determined problem. Should we close this issue? @zenandrea are the new binaries working for you?
Dear @prckent, new binaries seems to work well on the cases I have tested so far.
Disagreement between CPU and GPU DMC total energies was observed for a water molecule in periodic boundary conditions (8 A cubic cell, CASINO pseudopotentials, Titan at OLCF, QMCPACK 3.6.0). Issue originally reported by Andrea Zen. Original inputs and outputs: TEST_DMC.zip
From the attached outputs, the VMC energies agree, while the DMC energies differ by about 0.3 Ha:
The difference is entirely attributable to the local part of the ECP:
Note: the DMC error bars are not statistically meaningful here (10 blocks), but the difference is large enough to support this conclusion.
The oddity here is that the error is only seen in DMC and it is limited to a single potential energy term. This may indicate a bug in LocalECP that surfaces with increased walker count on the GPU (1 walker/gpu in VMC, 320 walkers/gpu in DMC). Likely, a series of VMC runs with increasing number of walkers will show this.