JoonhoLee-Group / ipie

ipie stands for Intelligent Python-based Imaginary-time Evolution with a focus on simplicity and speed.
Apache License 2.0
54 stars 27 forks source link

output returns nan #325

Open davidev886 opened 1 week ago

davidev886 commented 1 week ago

I am running ipie with gpu from the input here repo. The simulation is with a single slater determinant trial. After ~100 blocks I am getting as output a list of nan.

What can be the cause of that?

Thanks!

@zohimchandani

jiangtong1000 commented 1 week ago

hi, there may have multiple causes, to verify, can you run this system with cpu platform first, and see if this issue still happens? the system is not big so cpu calculation is feasible.

davidev886 commented 1 week ago

I could run the same input with only cpu and I went past the block where I saw the nan before. Here the output.

@zohimchandani

jiangtong1000 commented 1 week ago

@davidev886

Hi, I was not able to reproduce your issue with my machine and the newest code. Below is the output.

Maybe unrelated, but I noticed your log file used 8GPUs but only 1 MPI rank, and before running AFQMC, there were already 20+GB memory used by GPU. I think ipie calculation is in the middle step of a workflow? therefore, I am not sure if you are using the up-to-date ipie code, and if your code has some customized modifications. Maybe you can try running a pure ipie gpu calculation, with the up-to-date ipie package, I also attach my script to run it.

# random seed is 69963997
# Using pair_branch population control algorithm.
# target weight is 2000
# total weight is 2000
# ipie version: 0.7.1
# Git hash: 4b60ce0c923b90e8492b35ffb0e12600d1476ff4.
# Git branch: develop.
# Calculation uuid: 89f48e9e-9e51-11ef-a874-7c8ae1d2fe23.
# Approximate memory available per node: 503.0256 GB.
# Running on 1 MPI rank.
# Root processor name: holygpu7c26304.rc.fas.harvard.edu
# Python interpreter: 3.8.18 (default, Sep 11 2023, 13:40:15)  [GCC 11.2.0]
# Using numpy v1.24.4 from: /n/home06/tjiang/anaconda3/envs/ipie_cuda_awr/lib/python3.8/site-packages/numpy.
# - BLAS lib: openblas64_ openblas64_
# - BLAS dir: /usr/local/lib
# Using scipy v1.10.1 from: /n/home06/tjiang/anaconda3/envs/ipie_cuda_awr/lib/python3.8/site-packages/scipy.
# Using h5py v3.10.0 from: /n/home06/tjiang/anaconda3/envs/ipie_cuda_awr/lib/python3.8/site-packages/h5py.
# Using mpi4py v3.1.5 from: /n/home06/tjiang/anaconda3/envs/ipie_cuda_awr/lib/python3.8/site-packages/mpi4py.
# - mpicc: /n/sw/helmod-rocky8/apps/Comp/intel/23.0.0-fasrc01/openmpi/4.1.5-fasrc02/bin/mpicc
# Using cupy v10.6.0 from: /n/home06/tjiang/anaconda3/envs/ipie_cuda_awr/lib/python3.8/site-packages/cupy.
# - CUDA compute capability: 8.0
# - CUDA version: 11.03.0
# - GPU Type: 'NVIDIA A100-SXM4-40GB MIG 3g.20gb'
# - GPU Mem: 19.625 GB
# - Number of GPUs: 1
# MPI communicator : <class 'mpi4py.MPI.Intracomm'>
# Available memory on the node is 503.026 GB
# PhaselessGeneric: expected to allocate 0.0 GB
# PhaselessGeneric: using 2.8848876953125 GB out of 19.625 GB memory on GPU
# GenericRealChol: expected to allocate 0.13464972376823425 GB
# GenericRealChol: using 2.8848876953125 GB out of 19.625 GB memory on GPU
# SingleDet: expected to allocate 0.12498036026954651 GB
# SingleDet: using 2.8848876953125 GB out of 19.625 GB memory on GPU
# UHFWalkers: expected to allocate 0.6327738761901855 GB
# UHFWalkers: using 2.8848876953125 GB out of 19.625 GB memory on GPU
# Setting up estimator object.
# Writing estimator data to estimates.0.h5
# Finished settting up estimator object.
            Block                   Weight            WeightFactor            HybridEnergy                  ENumer                  EDenom                  ETotal                  E1Body                  E2Body
                0   2.0000000000000000e+03  2.0000000000000000e+03  0.0000000000000000e+00 -4.2474133767941929e+06  2.0000000000000000e+03 -2.1237066883970965e+03 -4.6825098198329779e+03  2.5588031314358814e+03
                1   1.1188629747355862e+12  2.6243216080320000e+13 -1.1616482970636268e+03 -4.2478920907418597e+06  2.0000000000000000e+03 -2.1239460453709298e+03 -4.6825140601661415e+03  2.5585680147952107e+03
.............
...........
               96   1.9918222841703948e+03  1.9922430847925423e+03 -1.1624895380279390e+03 -4.2495081613450758e+06  2.0000000000000000e+03 -2.1247540806725378e+03 -4.6828761881024957e+03  2.5581221074299579e+03
               97   1.9923872762499157e+03  1.9851204336501698e+03 -1.1624738695288606e+03 -4.2494981698897909e+06  2.0000000000000000e+03 -2.1247490849448955e+03 -4.6829300446362340e+03  2.5581809596913381e+03
               98   1.9914227335997296e+03  1.9876618050567181e+03 -1.1624803973736612e+03 -4.2494754285428245e+06  2.0000000000000000e+03 -2.1247377142714122e+03 -4.6828745832370314e+03  2.5581368689656192e+03
               99   1.9919000207066460e+03  1.9879594245308344e+03 -1.1624737401938430e+03 -4.2494945478917686e+06  2.0000000000000000e+03 -2.1247472739458844e+03 -4.6829494082471147e+03  2.5582021343012302e+03
              100   1.9920282215226082e+03  1.9909157125524450e+03 -1.1624692903201744e+03 -4.2494980338849565e+06  2.0000000000000000e+03 -2.1247490169424782e+03 -4.6829051661796984e+03  2.5581561492372202e+03
              101   1.9927240486379644e+03  1.9869659110822454e+03 -1.1624986256499319e+03 -4.2494899236370688e+06  2.0000000000000002e+03 -2.1247449618185342e+03 -4.6829569334110993e+03  2.5582119715925651e+03
              102   1.9921744324390111e+03  1.9921553172571541e+03 -1.1624782560979077e+03 -4.2494805103011113e+06  2.0000000000000000e+03 -2.1247402551505556e+03 -4.6829118792021036e+03  2.5581716240515475e+03
              103   1.9919194526291162e+03  1.9886685302439294e+03 -1.1624692767333779e+03 -4.2494663171149315e+06  2.0000000000000000e+03 -2.1247331585574657e+03 -4.6828729259196616e+03  2.5581397673621955e+03
....
....
              576   1.9914044919616053e+03  1.9843224008256011e+03 -1.1624650075555649e+03 -4.2494716990706921e+06  2.0000000000000000e+03 -2.1247358495353460e+03 -4.6832263916907086e+03  2.5584905421553631e+03
              577   1.9919773197172638e+03  1.9870035856242632e+03 -1.1624708853646089e+03 -4.2494465537481448e+06  2.0000000000000000e+03 -2.1247232768740723e+03 -4.6832044314074465e+03  2.5584811545333741e+03
              578   1.9913246259668103e+03  1.9880228756451565e+03 -1.1624664794876614e+03 -4.2494365723911617e+06  2.0000000000000000e+03 -2.1247182861955807e+03 -4.6832491078673120e+03  2.5585308216717308e+03
              579   1.9916341948312140e+03  1.9851045657565239e+03 -1.1624694264774987e+03 -4.2494449266646253e+06  2.0000000000000000e+03 -2.1247224633323126e+03 -4.6832948066610234e+03  2.5585723433287094e+03
              580   1.9905264583728278e+03  1.9836430561744271e+03 -1.1624414342396587e+03 -4.2494620954624563e+06  2.0000000000000000e+03 -2.1247310477312280e+03 -4.6833591506404955e+03  2.5586281029092675e+03
import os
import sys

import cupy
import h5py
import numpy
import numpy as np
from pyscf import cc, gto, scf

from ipie.config import config, MPI
from ipie.hamiltonians.generic import Generic as HamGeneric
from ipie.qmc.afqmc import AFQMC
from ipie.systems.generic import Generic
from ipie.trial_wavefunction.single_det import SingleDet
from ipie.utils.mpi import MPIHandler
from ipie.walkers.uhf_walkers import UHFWalkers

# Load Hamiltonian data
with h5py.File('tutorial_vqe/ipie_sd/hamiltonian.h5', 'r') as f:
    chol = f['LXmn'][:]
    chol = numpy.transpose(chol, (1, 2, 0))
    num_basis = chol.shape[0]
    nchol = chol.shape[-1]
    chol = chol.reshape((num_basis * num_basis, -1))
    e0 = f['e0'][()]
    hcore = f['hcore'][:]

# Load wavefunction data
with h5py.File('tutorial_vqe/ipie_sd/wavefunction.h5', 'r') as f:
    phi0a = f['phi0_alpha'][:]
    phi0b = f['phi0_beta'][:]
    psiT_a = f['psi_T_alpha'][:]
    psiT_b = f['psi_T_beta'][:]

# Configure GPU usage
config.update_option("use_gpu", True)

# Setup system parameters
num_basis = hcore.shape[-1]
nup, ndown = phi0a.shape[1], phi0b.shape[1]
mol_nelec = (nup, ndown)
system = Generic(nelec=mol_nelec)

# Setup Hamiltonian
h1e = numpy.array([hcore, hcore])
ham = HamGeneric(h1e, chol, e0)

# Setup trial wavefunction
handler = MPIHandler()
trial = SingleDet(numpy.hstack([psiT_a, psiT_b]), mol_nelec, num_basis, handler)
trial.build()
trial.half_rotate(ham)

# Setup AFQMC parameters
rng_seed = 69963997
num_walkers = 2000 // handler.size
nsteps = 25
nblocks = 2000
timestep = 0.005

# Setup walkers
walkers = UHFWalkers(numpy.hstack([phi0a, phi0b]), system.nup, system.ndown, ham.nbasis, num_walkers, mpi_handler=handler)
walkers.build(trial)

# Run AFQMC
afqmc = AFQMC.build(
    mol_nelec,
    ham,
    trial,
    walkers,
    num_walkers,
    rng_seed,
    nsteps,
    nblocks,
    timestep,
    mpi_handler=handler)

afqmc.run()
afqmc.finalise(verbose=True)