MatthewRHermes / mrh

MRH's research code
Other
18 stars 30 forks source link

FCI csf_solver memory usage #48

Open MatthewRHermes opened 10 months ago

MatthewRHermes commented 10 months ago

In two csf_solver functions (mrh.my_pyscf.fci.csf.pspace and mrh.my_pyscf.fci.csf.make_hdiag_csf), the memory usage is problematic due to the two-step evaluation of Hamiltonian matrix elements in the CSF basis, which requires the (temporary) construction of arrays of size quadratic with respect to the number of corresponding determinants. Massively open-shell low-spin wave functions become impossible in the current implementation around (16e,16o) because the corresponding determinants are too numerous to store the block Hamiltonian in memory. The relevant arrays should be split into blocks handled sequentially, as is done for DFT quadrature. In the mean time, segfaults and calculations abruptly killed by cluster daemon processes are symptoms of unmanaged memory usage, and indicate a region of the code where a memory-checking step should be added.

MatthewRHermes commented 10 months ago

Curiously, @valay1 had a segfault here,

2429 ******** <class 'mrh.my_pyscf.fci.csf.FCISolver'> ********
2430 max. cycles = 500
2431 conv_tol = 1e-10
2432 davidson only = False
2433 linear dependence = 1e-14
2434 level shift = 0.001
2435 max iter space = 12
2436 max_memory 512000 MB
2437 nroots = 1
2438 pspace_size = 200
2439 spin = None
2440 DFCASCI/DFCASSCF: density fitting for JK matrix and 2e integral transformation
2441 Start CASCI
2442     CPU time for integral transformation to CAS space     15.27 sec, wall time      0.96 sec
2443     CPU time for jk     17.37 sec, wall time      1.09 sec
2444     CPU time for jk     17.38 sec, wall time      1.09 sec
2445     CPU time for jk     17.29 sec, wall time      1.09 sec
2446     CPU time for jk     17.27 sec, wall time      1.09 sec
2447     CPU time for jk     17.29 sec, wall time      1.09 sec
2448     CPU time for jk     17.23 sec, wall time      1.08 sec
2449     CPU time for jk     17.36 sec, wall time      1.09 sec
2450     CPU time for jk     17.29 sec, wall time      1.09 sec
2451     CPU time for jk     17.29 sec, wall time      1.09 sec
2452     CPU time for jk     17.35 sec, wall time      1.09 sec
2453     CPU time for jk     17.41 sec, wall time      1.09 sec
2454     CPU time for jk     17.42 sec, wall time      1.10 sec
2455     CPU time for jk     17.31 sec, wall time      1.09 sec
2456     CPU time for jk     18.58 sec, wall time      1.26 sec
2457     CPU time for jk     17.38 sec, wall time      1.09 sec
2458     CPU time for jk     17.37 sec, wall time      1.09 sec
2459     CPU time for jk     17.32 sec, wall time      1.09 sec
2460     CPU time for jk     12.50 sec, wall time      0.79 sec
2461     CPU time for df vj and vk    308.40 sec, wall time     19.48 sec
2462 core energy = -4414.24210088297
2463     CPU time for effective h1e in CAS space    308.75 sec, wall time     19.51 sec
2464     CPU time for csf.kernel: throat-clearing      0.00 sec, wall time      0.00 sec
2465     CPU time for csf.kernel: hdiag_det      7.71 sec, wall time      0.75 sec
2466     CPU time for csf.kernel: hdiag_csf    731.38 sec, wall time     54.28 sec
2467     CPU time for csf.kernel: throat-clearing      2.89 sec, wall time      0.31 sec
2468 csf.pspace: Lowest-energy 200 CSFs correspond to 86 configurations which are spanned by 50388 determinants
2469 pspace_size of 200 CSFs -> 50388 determinants requires 24567.7991424 MB > 502162.878464 MB remaining memory
2470     CPU time for csf.pspace: index manipulation      1.85 sec, wall time      2.17 sec
~                                                                                                                                                                                                         

which suggests that it is here https://github.com/MatthewRHermes/mrh/blob/739a255533b7c03cab9001ca4f3cc6a1a0d2a915/my_pyscf/fci/csf.py#L284-L294 that a segfault is occuring (both timer_debug lines should flush the buffer). That is a call to a standard PySCF library function which performs no allocations, so I am somewhat confused. A misallocated array could also cause a segfault here, but I can't for the life of me see how any of the input arrays could be misallocated.

ETA: it's worth pointing out in this scenario that we are demanding a far larger matrix (50,000 vs 200) than that PySCF library function was likely ever tested on building.