CPU and GPU energies differ

QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support

http://www.qmcpack.org

Other

309 stars 140 forks source link

CPU and GPU energies differ #2348

Closed mcbennet closed 1 year ago

mcbennet commented 4 years ago

Describe the bug I am seeing a discrepancy between VMC energies when using GPU and CPU code.

(GPU) -1792.069 +/- 0.080 (CPU) -1793.274 +/- 0.039

The system is 80-atom BM SCO.

To Reproduce Steps to reproduce the behavior:

commit cf7e9bd4c1f3b27e162c402b5ce7c2e412347c9f
config/build_olcf_summit.sh is used to compile
Inputs and outputs for CPU and GPU runs are attached to this report. issue.tar.gz

System:

system name: summit

jtkrogel commented 4 years ago

I'll just note that Tomohiro has seen similar issues on Summit but for a different electronic system.

prckent commented 4 years ago

Thanks for reporting this. 80 atoms implies enough electrons that this would be a good run for GPU acceleration. Looking at cdash, where we test the CUDA version, I see more test failures than I would like for this version, but importantly the various solid state carbon diamond, LiH and NiO runs look to be OK. e.g. https://cdash.qmcpack.org/CDash/testDetails.php?test=7693162&build=108401 . The problem could be summit related, large system related (we only test small electron counts), or a more general problem due to the refactoring that is ongoing.

For convenience, please can you post the outputs of qmca here? Since this is VMC we can diagnose if any single component is bad - in DMC bad energies upset the sampled distribution.

mcbennet commented 4 years ago

                              CPU                           GPU
  LocalEnergy       =   -1793.274 +/-       0.039     -1792.069 +/-        0.080
  Variance          =       45.26 +/-        0.78         63.51 +/-         1.31
  Kinetic           =     1080.37 +/-        0.42       1077.99 +/-         1.39
  LocalPotential    =    -2873.64 +/-        0.43      -2870.05 +/-         1.43
  ElecElec          =      353.86 +/-        0.23        353.47 +/-         0.49
  LocalECP          =    -2289.09 +/-        0.76      -2283.74 +/-         2.06
  NonLocalECP       =      208.00 +/-        0.28        206.62 +/-         0.95
  IonIon            =    -1146.41 +/-        0.00      -1146.41 +/-         0.00
  LocalEnergy_sq    =  3215878.48 +/-      138.72    3211574.31 +/-       287.44
  MPC               =      354.17 +/-        0.23        353.78 +/-         0.50
  KEcorr            =        0.02 +/-        0.00          0.11 +/-         0.00
  BlockWeight       =     3360.00 +/-        0.00        480.00 +/-         0.00
  BlockCPU          =     20.2163 +/-      0.0079        15.857 +/-        0.020
  AcceptRatio       =    0.556250 +/-    0.000067       0.55647 +/-      0.00017
  Efficiency        =        3.54 +/-        0.00          7.39 +/-         0.00
  TotalTime         =      202.16 +/-        0.00        158.57 +/-         0.00
  TotalSamples      =       33600 +/-           0          4800 +/-            0
  ------------------------------------------------------------------------------
  CorrectedEnergy   =   -1792.944 +/-       0.038     -1791.648 +/-        0.073

prckent commented 4 years ago

Ouch. So the wavefunctions are bad/different in some way - the kinetic energy disagrees. Implies bad orbitals, updates or jastrow gradients.

prckent commented 4 years ago

Four months later, any new guesses on what might be causing this issue? Clearly it is a reason to not use the legacy CUDA code. It is interesting that this is a highly spin polarized setup - 192 up spin, 168 down spin electrons.

ye-luo commented 1 year ago

Legacy CUDA has been removed from the code base https://github.com/QMCPACK/qmcpack/pull/4431