QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
293 stars 137 forks source link

randomize_grid does not randomize the PP grid as expected #4362

Closed markdewing closed 1 year ago

markdewing commented 1 year ago

I would expect the function NonLocalECPComponent::randomize_grid to randomize the grid origin across the entire sphere. But it does not - it's missing some coverage around the "poles". See the attached graph.

The root cause is the initialization of cth from a random number. It is currently cth = myRNG() - 0.5. I think it should be cth = 1.0 - 2*myRNG().

I'm not sure this issue has a practical impact, since the rotation is only meant to change the origin of the spherical integration grid and not cover the sphere by itself. If the grid is sufficiently dense, the other points will cover the missing regions from the origin. (But I didn't actually check this)

There is another minor annoyance in that when initializing the current code with all zeros for the rng values, the result is not the identity rotation. With the corrected initialization, using zeros does result in the identity rotation.

I tried to find the origin of the formula in randomize_grid. Nothing online seemed to match, and I couldn't recreate it through multiplying 2x2 rotation matrices (though that search was not exhaustive).

Plot of random rotations applied an input vector (0,0, 1) random_rotation

The script used to generate the plot

check_grid_randomize.tar.gz

prckent commented 1 year ago

Thanks Mark. Indeed this seems like a bug, since we shouldn't rely on grid density or symmetry to get the correct result. It should be correct even if we use one point, which it will not be based on your plot.

The functions in question are at https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCHamiltonians/NonLocalECPComponent.cpp#L868 (we have two variants)

It is also interesting that we use 3 random numbers for this.

jtkrogel commented 1 year ago

This is how the code looked when randomize_grid was introduced in Mar. 2005 (commit: a9b5aadbe8b37609c7f65222df8d1daec5f97f93, file NonLocalPPotential.cpp):

  void NonLocalPPotential::RadialPotentialSet::randomize_grid(){
    const double twopi=6.28318531;
    double phi,psi,sph,cph,sth,cth,sps,cps;
    Tensor<double,3> rmat;
    /* Rotation matrix with Euler angles defined as: 
       1) counterclockwise rotation around z (phi)
       2) clockwise rotation around y (theta)
       3) counterclockwise rotation around z (psi) */
    phi=twopi*Random(); sph=sin(phi); cph=cos(phi);
    psi=twopi*Random(); sps=sin(psi); cps=cos(psi);
    cth=2*(Random()-0.5); sth=sqrt(1-cth*cth);
    rmat(0,0)= cph*cth*cps-sph*sps;
    rmat(0,1)= sph*cth*cps+cph*sps;
    rmat(0,2)=-sth*cps;
    rmat(1,0)=-cph*cth*sps-sph*cps;
    rmat(1,1)=-sph*cth*sps+cph*cps;
    rmat(1,2)= sth*sps;
    rmat(2,0)= cph*sth;
    rmat(2,1)= sph*sth;
    rmat(2,2)= cth;
    for (vector<PosType>::iterator it=sgridxyz_m.begin(); 
    it != sgridxyz_m.end(); ++it) *it=dot(rmat,*it);
  }

Notice 1) more comments, 2) the additional factor of 2 in cth that is now missing.

jtkrogel commented 1 year ago

The factor of two was removed when randomize_grid was moved to the header file a week later in commit 26eaba3fb248894ba78e517012644b3bd80aaec9

jtkrogel commented 1 year ago

The overall method may be closely related to details contained in Murnahan (https://doi.org/10.1073%2Fpnas.36.11.670) and Miles (https://doi.org/10.2307/2333716).

jtkrogel commented 1 year ago

Is the density uniform if you simply propagate a single vector many times with the uncorrected code?

markdewing commented 1 year ago

The plot shown (and the script) applies many random rotations to a single vector (unit vector along the z-axis). Do you mean some other scheme - like applying random rotations to random vectors? I didn't measure the resulting distribution in any more detail than just visual inspection.

prckent commented 1 year ago

It is quite likely that this is a very small bias in practical calculations since we have large default integration grids and the points used in the integration grids (which are referenced to electron positions) will vary considerably unlike the fixed (0,0,1) in this test. However we should fix it since there are cases when there could be a bias, including for coarse grids. That large runs are OK can be tested since we will see if the statistical tests change meaningfully. Tests have also been done vs Hartree-Fock orbitals in several cases, and the problem did not show up.

The fix unfortunately likely requires updating a great many deterministic tests. Mark, can you please check the situation with your fix? Then we will know how much work is entailed. We also clearly need a unit test that calls this function 10^6 times and verifies that we can evaluate a simple numerical integral correctly. e.g. We would presumably fail to integrate the surface area of a sphere with the current code and single point grids.

markdewing commented 1 year ago

Using a real, full precision build with the corrected grid randomization:

(I had left some stats for the tests earlier, but deleted them after I discovered I had disabled the grid randomization entirely)

jtkrogel commented 1 year ago

I agree with Paul. That we haven't seen a discrepancy before now in reference testing suggests any bias is very small. Likely worthwhile to explicitly test with a long DMC run (longer than a "long" test) of a small molecule (e.g. H2O) before and after the fix.

prckent commented 1 year ago

To gather statics, I am currently running 10 sets of all the long tests, with and without the proposed fix. This will take ~a day to run.

Added 14 Dec: Had to restart due to not setting adequate test timeouts. Rerunning with a smaller ensemble. So far it looks like the main breakage is in the deterministic tests.

prckent commented 1 year ago

@markdewing put the matrix rotater in Numerics and remove all physics references? This is only maths/numerics.

markdewing commented 1 year ago

I had an idea for incremental fixing of the deterministic tests. One of the features I want for testing is to be able to turn off the grid randomization from the input file. Implementing that feature would not cause a breaking change. Input files and deterministic tests could be updated individually to have the grid randomization turned off. Then applying the fix to the randomization wouldn't break the tests. The downside is then the tests would need to be updated again if we want grid randomization to be included in the test results. But maybe only a subset of them would need to get this second update.

prckent commented 1 year ago

Under what circumstances is the grid reoriented today? Some QMC codes reorient every single evaluation.

camelto2 commented 1 year ago

Under what circumstances is the grid reoriented today? Some QMC codes reorient every single evaluation.

It looks like it is every evaluation...for example see NonLocalECPotential::evaluateImp. gets randomized a few lines into the funciton

prckent commented 1 year ago

Test results from 4 runs of a real CPU build gcc no mpi, ctest -R 'deterministic|long'. No short or converter (etc.) tests run.

$ grep "Test.*Failed" build_real/out_?
build_real/out_1:1019/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_real/out_1:1021/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_real/out_2:  35/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_real/out_2:  36/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.05 sec
build_real/out_2: 978/1089 Test  #334: long-heg_14_gamma-sjb-1-16-totenergy ..............................................................***Failed    0.05 sec
build_real/out_3:  35/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_real/out_3:  36/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_real/out_3: 132/1089 Test #1389: long-diamondC_2x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.08 sec
build_real/out_3: 171/1089 Test #1014: long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16-flux ..................................................***Failed    0.07 sec
build_real/out_3: 550/1089 Test  #999: long-diamondC_1x1x1_pp-vmc-dmc-allp_sdj-1-16-2-totenergy ..........................................***Failed    0.05 sec
build_real/out_3: 952/1089 Test #1844: long-monoO_1x1x1_pp-vmc_sdj3-1-16-flux ............................................................***Failed    0.05 sec
build_real/out_4: 786/1089 Test #1388: long-diamondC_2x1x1_pp-vmc_sdj-1-16-samples .......................................................***Failed    0.04 sec
build_real/out_4: 793/1089 Test  #972: long-diamondC_1x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.05 sec
build_real/out_4: 802/1089 Test #1839: long-monoO_1x1x1_pp-vmc_sdj-1-16-samples ..........................................................***Failed    0.04 sec
build_real/out_4: 809/1089 Test  #449: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-totenergy .....................................................***Failed    0.03 sec
build_real/out_4: 810/1089 Test  #448: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-kinetic .......................................................***Failed    0.03 sec
build_real/out_4: 811/1089 Test  #451: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-samples .......................................................***Failed    0.03 sec
build_real/out_4: 813/1089 Test  #450: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-eeenergy ......................................................***Failed    0.04 sec
build_real/out_4: 868/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_real/out_4: 869/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.03 sec
$ grep "Test.*Failed" build_real/out_?|wc -l
20

So, an average of 5 failed tests per run, and only long run failures.

Now with the proposed fix, too many deterministic tests fail than should be listed here:

$ grep "Test.*Failed" build_fixed_real/out_?|head -30
build_fixed_real/out_1: 116/1089 Test  #384: deterministic-C2_pp_msdj-H5-param-grad-1-1-CIcoeff_4 ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 117/1089 Test  #385: deterministic-C2_pp_msdj-H5-param-grad-1-1-ud_4 ...................................................***Failed    0.04 sec
build_fixed_real/out_1: 118/1089 Test  #386: deterministic-C2_pp_msdj-H5-param-grad-1-1-udC_25 .................................................***Failed    0.04 sec
build_fixed_real/out_1: 131/1089 Test  #498: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-totenergy .................................................***Failed    0.03 sec
build_fixed_real/out_1: 134/1089 Test  #501: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-potential .................................................***Failed    0.03 sec
build_fixed_real/out_1: 137/1089 Test  #504: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-nonlocalecp ...............................................***Failed    0.04 sec
build_fixed_real/out_1: 141/1089 Test  #508: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1: 143/1089 Test  #511: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-potential ................................................***Failed    0.03 sec
build_fixed_real/out_1: 146/1089 Test  #514: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-nonlocalecp ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 150/1089 Test  #576: deterministic-FeCO6_gms-vmc_noj-1-1-totenergy .....................................................***Failed    0.03 sec
build_fixed_real/out_1: 152/1089 Test  #579: deterministic-FeCO6_gms-vmc_noj-1-1-potential .....................................................***Failed    0.03 sec
build_fixed_real/out_1: 157/1089 Test  #582: deterministic-FeCO6_gms-vmc_noj-1-1-nonlocalecp ...................................................***Failed    0.03 sec
build_fixed_real/out_1: 159/1089 Test  #585: deterministic-FeCO6_gms-vmcbatch_noj-1-1-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1: 162/1089 Test  #588: deterministic-FeCO6_gms-vmcbatch_noj-1-1-potential ................................................***Failed    0.03 sec
build_fixed_real/out_1: 164/1089 Test  #591: deterministic-FeCO6_gms-vmcbatch_noj-1-1-nonlocalecp ..............................................***Failed    0.03 sec
build_fixed_real/out_1: 167/1089 Test  #604: deterministic-FeCO6_pyscf-vmc_noj-1-1-totenergy ...................................................***Failed    0.04 sec
build_fixed_real/out_1: 169/1089 Test  #607: deterministic-FeCO6_pyscf-vmc_noj-1-1-potential ...................................................***Failed    0.03 sec
build_fixed_real/out_1: 172/1089 Test  #610: deterministic-FeCO6_pyscf-vmc_noj-1-1-nonlocalecp .................................................***Failed    0.04 sec
build_fixed_real/out_1: 176/1089 Test  #613: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-totenergy ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 178/1089 Test  #616: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-potential ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 185/1089 Test  #619: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-nonlocalecp ............................................***Failed    0.05 sec
build_fixed_real/out_1: 230/1089 Test #1026: deterministic-diamondC_1x1x1_pp-dmc-estimator-spindensity-1-1-check ...............................***Failed    0.50 sec
build_fixed_real/out_1: 232/1089 Test #1041: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-totenergy .............................................***Failed    0.03 sec
build_fixed_real/out_1: 236/1089 Test #1020: deterministic-diamondC_1x1x1_pp-dmc-estimator-density-1-1-check ...................................***Failed    0.63 sec
build_fixed_real/out_1: 237/1089 Test #1034: deterministic-diamondC_1x1x1_pp-dmc-estimator-1rdm-1-1-check ......................................***Failed    0.49 sec
build_fixed_real/out_1: 240/1089 Test #1044: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-potential .............................................***Failed    0.04 sec
build_fixed_real/out_1: 246/1089 Test #1047: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-nonlocalecp ...........................................***Failed    0.05 sec
build_fixed_real/out_1: 247/1089 Test #1052: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-totenergy .....................................***Failed    0.04 sec
build_fixed_real/out_1: 250/1089 Test #1055: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-potential .....................................***Failed    0.03 sec
build_fixed_real/out_1: 257/1089 Test #1058: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp ...................................***Failed    0.05 sec

$ grep "Test.*Failed" build_fixed_real/out_?|wc -l
1100

An average 275 failures per run.

$ grep "Test.*Failed" build_fixed_real/out_?| grep long
build_fixed_real/out_1: 828/1089 Test  #355: long-li2_sto-sj_dmc-1-16-variance .................................................................***Failed    0.06 sec
build_fixed_real/out_1:1016/1089 Test #1697: long-diamondC_1x1x1-Gaussian_pp_MSD-1-16-localecp .................................................***Failed    0.03 sec
build_fixed_real/out_1:1019/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1:1021/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_fixed_real/out_2: 870/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_fixed_real/out_2: 871/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.03 sec
build_fixed_real/out_3:  37/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_3:  38/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_fixed_real/out_3: 324/1089 Test #1389: long-diamondC_2x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.04 sec
build_fixed_real/out_3: 328/1089 Test #1013: long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16-samples ...............................................***Failed    0.04 sec
build_fixed_real/out_3: 345/1089 Test  #976: long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples .................................................***Failed    0.04 sec
build_fixed_real/out_3: 360/1089 Test  #357: long-li2_sto-sj_dmc-1-16-2-variance ...............................................................***Failed    0.04 sec
build_fixed_real/out_3: 367/1089 Test  #858: long-bccH_1x1x1_ae-vmc_sdj-1-16-samples ...........................................................***Failed    0.03 sec
build_fixed_real/out_4: 780/1089 Test  #977: long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-flux ....................................................***Failed    0.07 sec
build_fixed_real/out_4: 874/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_fixed_real/out_4: 875/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
$ grep "Test.*Failed" build_fixed_real/out_?| grep long|wc -l
16

An average 4 failed long runs. This is actually one better than the unfixed code, and it seems that there were problems unrelated to the fix (SIGHUP?) with both long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16 and long-diamondC_1x1x1_pp-vmc_sdj_meshf-1-16, explaining the reported samples failures. And none of the long nonlocal checks fails, so we know that the nonlocal pseudopotential energies are consistent.

Therefore for the elements and levels of statistical accuracies of our test set it is fair to conclude that the problem averages out but our deterministic tests are highly sensitive to it.

ye-luo commented 1 year ago

@prckent close?