Implement orbital evaluation

willwheelera commented 8 months ago

PySCF's eval_gto seems to be quite slow. Code for other methods doesn't call it frequently, so it hasn't been a bottleneck except in QMC. We can probably code it up in a faster way.

See https://pyscf.org/user/gto.html#basis-format for basis format definitions in PySCF.

I've gotten started trying this out -- we need tests to make sure the new functions are consistent with the old ones, make sure all cases are implemented, and we need profiling to see where we can get performance improvements. Then we also need to implement PBC evaluations.

lkwagner commented 8 months ago

I did a little bit of profiling using this setup:

import pyscf
import pyscf.pbc
import numpy as np 
import pyscf.pbc.dft
cell = pyscf.pbc.gto.Cell()
cell.atom = '''C     0.      0.      0.    
              C     0.8917  0.8917  0.8917
              C     1.7834  1.7834  0.    
              C     2.6751  2.6751  0.8917
              C     1.7834  0.      1.7834
              C     2.6751  0.8917  2.6751
              C     0.      1.7834  1.7834
              C     0.8917  2.6751  2.6751'''
cell.basis = 'ccecp-ccpvdz'
cell.ecp = 'ccecp'
cell.exp_to_discard = 0.2
cell.a = np.eye(3)*3.5668
cell.verbose=3
cell.build()
mf  = pyscf.pbc.dft.RKS(cell).density_fit()
mf.chkfile = 'diamond.chk'
mf.kernel()

And this run:

import pyqmc.api as pyq
import time

cell, mf = pyq.recover_pyscf("diamond.chk")
#cell.rcut=1
enacc = pyq.EnergyAccumulator(cell, threshold=10)
wf, to_opt = pyq.generate_wf(cell, mf)
configs = pyq.initial_guess(cell, 500)
start = time.perf_counter()
pyq.vmc(wf, configs, nblocks=1, accumulators={'energy':enacc})
print("time to run VMC", time.perf_counter()-start)

With everything default, the profile looked like:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1521/1    0.020    0.000   50.200   50.200 {built-in method builtins.exec}
        1    0.000    0.000   50.200   50.200 run_vmc2.py:1(<module>)
        1    0.000    0.000   47.875   47.875 mc.py:161(vmc)
        1    0.015    0.015   47.875   47.875 mc.py:89(vmc_worker)
       10    0.000    0.000   31.100    3.110 accumulators.py:53(avg)
       10    0.000    0.000   31.099    3.110 accumulators.py:30(__call__)
     3521    0.031    0.000   27.868    0.008 orbitals.py:321(aos)
     3521   27.326    0.008   27.617    0.008 eval_gto.py:31(eval_gto)
       10    0.003    0.000   15.370    1.537 energy.py:44(kinetic)
      320    0.003    0.000   15.364    0.048 multiplywf.py:107(gradient_laplacian)
      320    0.002    0.000   15.353    0.048 multiplywf.py:108(<listcomp>)
      640    0.002    0.000   15.326    0.024 multiplywf.py:102(gradient_value)
      640    0.002    0.000   15.311    0.024 multiplywf.py:103(<listcomp>)
      320    0.019    0.000   12.614    0.039 slater.py:382(gradient_laplacian)
      640    0.015    0.000   11.535    0.018 slater.py:353(gradient_value)
       10    0.012    0.001    9.714    0.971 eval_ecp.py:6(ecp)
     2560    0.086    0.000    9.703    0.004 eval_ecp.py:62(ecp_ea)
     2560    0.006    0.000    8.006    0.003 multiplywf.py:92(testvalue)
     2560    0.005    0.000    7.976    0.003 multiplywf.py:94(<listcomp>)

Setting rcut = 1 resulted in:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1521/1    0.019    0.000   38.562   38.562 {built-in method builtins.exec}
        1    0.000    0.000   38.562   38.562 run_vmc2.py:1(<module>)
        1    0.000    0.000   36.018   36.018 mc.py:161(vmc)
        1    0.016    0.016   36.018   36.018 mc.py:89(vmc_worker)
       10    0.000    0.000   24.368    2.437 accumulators.py:53(avg)
       10    0.000    0.000   24.368    2.437 accumulators.py:30(__call__)
     3521    0.030    0.000   16.374    0.005 orbitals.py:321(aos)
     3521   15.843    0.004   16.126    0.005 eval_gto.py:31(eval_gto)
       10    0.003    0.000   11.309    1.131 energy.py:44(kinetic)
      320    0.004    0.000   11.303    0.035 multiplywf.py:107(gradient_laplacian)
      320    0.002    0.000   11.292    0.035 multiplywf.py:108(<listcomp>)
      640    0.002    0.000   10.455    0.016 multiplywf.py:102(gradient_value)
      640    0.002    0.000   10.440    0.016 multiplywf.py:103(<listcomp>)
      320    0.020    0.000    8.585    0.027 slater.py:382(gradient_laplacian)

This is about a 20% improvement in runtime, and a factor of two in eval_gto(). Obviously the accuracy goes down but even with rcut=1, the L's include all the nearest neighbors, which should be roughly good enough. Or maybe it's good enough for optimization, and then for accurate calculations we crank up rcut?

Separately, you can verify that computing the second derivative matrix (as we do for the Laplacian) costs about twice the first derivatives. In QWalk, the laplacian costs about 1/3 of the first derivatives because we compute it directly. If we had a function that computed vgl (value, gradient, laplacian), we could save about a factor of 2 in the kinetic energy calculation. In the above calculation, the time breaks down as:

time for one MC sweep 1.1400545838987455
coulomb energy 0.5836180420592427
ECP energy 0.7497833750676364
Kinetic energy 1.1373411249369383

So reducing the kinetic energy by a factor of 2 would result in a pretty good savings in computational time. This can be done by computing the laplacian of the AOs directly rather than the full Hessian. All the formulas are already in QWalk's source code.

So just reducing rcut and implementing a dedicated Laplacian function in eval_gto.c in pyscf would result in pretty significant gains.

lkwagner commented 8 months ago

Some more data on rcut. The same example as above. This is the time taken per MC sweep for 500 walkers for different values of rcut. Looks like the energy does not really depend on it at least to a few mHa.

rcut

lkwagner commented 7 months ago

This has been partially resolved by the above PR. However, the laplacian optimization is still to do.

WagnerGroup / pyqmc

Implement orbital evaluation #407