Open willwheelera opened 8 months ago
I did a little bit of profiling using this setup:
import pyscf
import pyscf.pbc
import numpy as np
import pyscf.pbc.dft
cell = pyscf.pbc.gto.Cell()
cell.atom = '''C 0. 0. 0.
C 0.8917 0.8917 0.8917
C 1.7834 1.7834 0.
C 2.6751 2.6751 0.8917
C 1.7834 0. 1.7834
C 2.6751 0.8917 2.6751
C 0. 1.7834 1.7834
C 0.8917 2.6751 2.6751'''
cell.basis = 'ccecp-ccpvdz'
cell.ecp = 'ccecp'
cell.exp_to_discard = 0.2
cell.a = np.eye(3)*3.5668
cell.verbose=3
cell.build()
mf = pyscf.pbc.dft.RKS(cell).density_fit()
mf.chkfile = 'diamond.chk'
mf.kernel()
And this run:
import pyqmc.api as pyq
import time
cell, mf = pyq.recover_pyscf("diamond.chk")
#cell.rcut=1
enacc = pyq.EnergyAccumulator(cell, threshold=10)
wf, to_opt = pyq.generate_wf(cell, mf)
configs = pyq.initial_guess(cell, 500)
start = time.perf_counter()
pyq.vmc(wf, configs, nblocks=1, accumulators={'energy':enacc})
print("time to run VMC", time.perf_counter()-start)
With everything default, the profile looked like:
ncalls tottime percall cumtime percall filename:lineno(function)
1521/1 0.020 0.000 50.200 50.200 {built-in method builtins.exec}
1 0.000 0.000 50.200 50.200 run_vmc2.py:1(<module>)
1 0.000 0.000 47.875 47.875 mc.py:161(vmc)
1 0.015 0.015 47.875 47.875 mc.py:89(vmc_worker)
10 0.000 0.000 31.100 3.110 accumulators.py:53(avg)
10 0.000 0.000 31.099 3.110 accumulators.py:30(__call__)
3521 0.031 0.000 27.868 0.008 orbitals.py:321(aos)
3521 27.326 0.008 27.617 0.008 eval_gto.py:31(eval_gto)
10 0.003 0.000 15.370 1.537 energy.py:44(kinetic)
320 0.003 0.000 15.364 0.048 multiplywf.py:107(gradient_laplacian)
320 0.002 0.000 15.353 0.048 multiplywf.py:108(<listcomp>)
640 0.002 0.000 15.326 0.024 multiplywf.py:102(gradient_value)
640 0.002 0.000 15.311 0.024 multiplywf.py:103(<listcomp>)
320 0.019 0.000 12.614 0.039 slater.py:382(gradient_laplacian)
640 0.015 0.000 11.535 0.018 slater.py:353(gradient_value)
10 0.012 0.001 9.714 0.971 eval_ecp.py:6(ecp)
2560 0.086 0.000 9.703 0.004 eval_ecp.py:62(ecp_ea)
2560 0.006 0.000 8.006 0.003 multiplywf.py:92(testvalue)
2560 0.005 0.000 7.976 0.003 multiplywf.py:94(<listcomp>)
Setting rcut = 1 resulted in:
ncalls tottime percall cumtime percall filename:lineno(function)
1521/1 0.019 0.000 38.562 38.562 {built-in method builtins.exec}
1 0.000 0.000 38.562 38.562 run_vmc2.py:1(<module>)
1 0.000 0.000 36.018 36.018 mc.py:161(vmc)
1 0.016 0.016 36.018 36.018 mc.py:89(vmc_worker)
10 0.000 0.000 24.368 2.437 accumulators.py:53(avg)
10 0.000 0.000 24.368 2.437 accumulators.py:30(__call__)
3521 0.030 0.000 16.374 0.005 orbitals.py:321(aos)
3521 15.843 0.004 16.126 0.005 eval_gto.py:31(eval_gto)
10 0.003 0.000 11.309 1.131 energy.py:44(kinetic)
320 0.004 0.000 11.303 0.035 multiplywf.py:107(gradient_laplacian)
320 0.002 0.000 11.292 0.035 multiplywf.py:108(<listcomp>)
640 0.002 0.000 10.455 0.016 multiplywf.py:102(gradient_value)
640 0.002 0.000 10.440 0.016 multiplywf.py:103(<listcomp>)
320 0.020 0.000 8.585 0.027 slater.py:382(gradient_laplacian)
This is about a 20% improvement in runtime, and a factor of two in eval_gto(). Obviously the accuracy goes down but even with rcut=1, the L's include all the nearest neighbors, which should be roughly good enough. Or maybe it's good enough for optimization, and then for accurate calculations we crank up rcut?
Separately, you can verify that computing the second derivative matrix (as we do for the Laplacian) costs about twice the first derivatives. In QWalk, the laplacian costs about 1/3 of the first derivatives because we compute it directly. If we had a function that computed vgl (value, gradient, laplacian), we could save about a factor of 2 in the kinetic energy calculation. In the above calculation, the time breaks down as:
time for one MC sweep 1.1400545838987455
coulomb energy 0.5836180420592427
ECP energy 0.7497833750676364
Kinetic energy 1.1373411249369383
So reducing the kinetic energy by a factor of 2 would result in a pretty good savings in computational time. This can be done by computing the laplacian of the AOs directly rather than the full Hessian. All the formulas are already in QWalk's source code.
So just reducing rcut and implementing a dedicated Laplacian function in eval_gto.c in pyscf would result in pretty significant gains.
Some more data on rcut. The same example as above. This is the time taken per MC sweep for 500 walkers for different values of rcut. Looks like the energy does not really depend on it at least to a few mHa.
This has been partially resolved by the above PR. However, the laplacian optimization is still to do.
PySCF's eval_gto seems to be quite slow. Code for other methods doesn't call it frequently, so it hasn't been a bottleneck except in QMC. We can probably code it up in a faster way.
See https://pyscf.org/user/gto.html#basis-format for basis format definitions in PySCF.
I've gotten started trying this out -- we need tests to make sure the new functions are consistent with the old ones, make sure all cases are implemented, and we need profiling to see where we can get performance improvements. Then we also need to implement PBC evaluations.