Closed mcbennet closed 1 year ago
I'll just note that Tomohiro has seen similar issues on Summit but for a different electronic system.
Thanks for reporting this. 80 atoms implies enough electrons that this would be a good run for GPU acceleration. Looking at cdash, where we test the CUDA version, I see more test failures than I would like for this version, but importantly the various solid state carbon diamond, LiH and NiO runs look to be OK. e.g. https://cdash.qmcpack.org/CDash/testDetails.php?test=7693162&build=108401 . The problem could be summit related, large system related (we only test small electron counts), or a more general problem due to the refactoring that is ongoing.
For convenience, please can you post the outputs of qmca here? Since this is VMC we can diagnose if any single component is bad - in DMC bad energies upset the sampled distribution.
CPU GPU
LocalEnergy = -1793.274 +/- 0.039 -1792.069 +/- 0.080
Variance = 45.26 +/- 0.78 63.51 +/- 1.31
Kinetic = 1080.37 +/- 0.42 1077.99 +/- 1.39
LocalPotential = -2873.64 +/- 0.43 -2870.05 +/- 1.43
ElecElec = 353.86 +/- 0.23 353.47 +/- 0.49
LocalECP = -2289.09 +/- 0.76 -2283.74 +/- 2.06
NonLocalECP = 208.00 +/- 0.28 206.62 +/- 0.95
IonIon = -1146.41 +/- 0.00 -1146.41 +/- 0.00
LocalEnergy_sq = 3215878.48 +/- 138.72 3211574.31 +/- 287.44
MPC = 354.17 +/- 0.23 353.78 +/- 0.50
KEcorr = 0.02 +/- 0.00 0.11 +/- 0.00
BlockWeight = 3360.00 +/- 0.00 480.00 +/- 0.00
BlockCPU = 20.2163 +/- 0.0079 15.857 +/- 0.020
AcceptRatio = 0.556250 +/- 0.000067 0.55647 +/- 0.00017
Efficiency = 3.54 +/- 0.00 7.39 +/- 0.00
TotalTime = 202.16 +/- 0.00 158.57 +/- 0.00
TotalSamples = 33600 +/- 0 4800 +/- 0
------------------------------------------------------------------------------
CorrectedEnergy = -1792.944 +/- 0.038 -1791.648 +/- 0.073
Ouch. So the wavefunctions are bad/different in some way - the kinetic energy disagrees. Implies bad orbitals, updates or jastrow gradients.
Four months later, any new guesses on what might be causing this issue? Clearly it is a reason to not use the legacy CUDA code. It is interesting that this is a highly spin polarized setup - 192 up spin, 168 down spin electrons.
Legacy CUDA has been removed from the code base https://github.com/QMCPACK/qmcpack/pull/4431
Describe the bug I am seeing a discrepancy between VMC energies when using GPU and CPU code.
(GPU) -1792.069 +/- 0.080 (CPU) -1793.274 +/- 0.039
The system is 80-atom BM SCO.
To Reproduce Steps to reproduce the behavior:
System: