CPU usage on cluster - Githubissues

I am finding unusually low CPU usage when running many block2 codes at the same time in a cluster.

I have block2==0.5.3rc19 installed with python 3.11 and CentOS 7. Each node is a HPE DL385 Gen10 Rackmount server with 2 x 2nd Generation AMD EPYC Rome 7302, 3.0GHz, 128MB Cache 16 Cores Processor with at least 256GB RAM. I use PBS to submit an array of 1000 jobs with #PBS -l nodes=1:ppn=1 and #PBS -J 1-1000:1. This submits 32 jobs on each node, each using a different scratch directory. However, logging into compute nodes I find that each job is using far less than 100% CPU, as shown below:

The usage increases slowly but remains low for most of the runtime (~10min of CPU time). However, if I submit just one job, the usage quickly reaches 100%. Even if I submit 32 jobs to a specific node, the usage is improved compared to the figure above.

I'm not sure where the problem is. Memory should not be a problem as the peak RAM usage for each job is 1GB.

The job script runs DMRG for a Heisenberg model for all values of the total spin (see below -- I have omitted some print statements and calculation of correlations). Here I didn't specify n_threads in DMRGDriver, assuming it would be set automatically based on ppn=1. The problem persists if I also set n_threads=1.

from pyblock2.driver.core import DMRGDriver, SymmetryTypes
import argparse
import pandas as pd
import json
import time
import csv
import resource

def hamil_Heisenberg_Spin(driver, bonds):
    os = driver.expr_builder()
    numbonds = bonds.shape[0]
    for i in range(numbonds):
        os.add_term("(T+T)0", bonds[i].tolist(), - 3**0.5 / 2)
    mpo = driver.get_mpo(os.finalize(adjust_order=False))
    return mpo

.... Read inputs ....

bonds = pd.read_csv(graphname+'_bonds.csv', header=None).values - 1      # pairs of sites
L = max(bonds.flatten()) + 1
numbonds = bonds.shape[0]

print('\nFound graph with', L, 'sites and', numbonds, 'bonds:\n', bonds.tolist())

block2paramfile = 'block2_AFM_dmrgparams.json'
with open(block2paramfile, 'r') as f:
    dmrgparams = json.load(f)[block2paramID]
n_sweeps = dmrgparams['n_sweeps']
bond_dims = dmrgparams['bond_dims']
noises = dmrgparams['noises']
dav_max_iter = dmrgparams['dav_max_iter']
thrds = dmrgparams['thrds']

driver = DMRGDriver(scratch="./tmp", symm_type=SymmetryTypes.SU2)

final_energies = []
final_bonddims = []
dmrg_times = []
energy_evol = []
energy_min = numbonds

for twoS in range(L,-1,-2):
    driver.initialize_system(n_sites=L, heis_twos=1, spin=twoS)
    mpo = hamil_Heisenberg_Spin(driver, bonds)
    mps = driver.get_random_mps(tag="KET", bond_dim=bond_dims[0], nroots=1)
    dmrg_begin = time.time()
    energy = driver.dmrg(
        mpo, 
        mps, 
        n_sweeps = n_sweeps, 
        bond_dims = bond_dims, 
        noises = noises, 
        dav_max_iter = dav_max_iter, 
        thrds = thrds, 
        iprint = printsweeps
    )
    dmrg_end = time.time()
    print('Final energy for 2S = %02d' % twoS, 'is %10.8f' % energy)
    final_energies.append(energy)
    final_bonddims.append(mps.info.get_max_bond_dimension())
    dmrg_times.append(dmrg_end - dmrg_begin)
    energy_evol.append(driver.get_dmrg_results()[2][:,0])
    if energy < energy_min:
        twoS_min = twoS
        energy_min = energy
        mps_min = driver.copy_mps(mps, tag="KET0")

print('\nCalculating spin correlations in the lowest-energy state')

driver.initialize_system(n_sites=L, heis_twos=1, spin=twoS_min)
corrvals = []
for i in range(L):
    for j in range(i+1,L):
        os = driver.expr_builder().add_term("(T+T)0", [i, j], - 3**0.5 / 2)
        op = driver.get_mpo(os.finalize(adjust_order=False))
        opexp = driver.expectation(mps_min, op, mps_min)
        corrvals.append(opexp)

.... Calculate correlations and save output ....

Similar issue is found in a different cluster using 2 x Intel Xeon Platinum 8268, 24 cores, 2.9 GHz, processors per node with 192GB RAM, running Linux – CentOS 7.9. The wall time is more than twice the CPU time for completed jobs submitted via SLURM with --ntasks-per-node=1.

Make sure the iprint parameter in driver.dmrg is exactly 2, and then provide a sample of the DMRG sweep output for: (a) the case when the job is running slowly; (b) the case when the job is running normally (using 100% CPU). You can attach files.
Please provide the following: (a) the full filesystem path to the job script; (b) the full filesystem path to the python script; (c) the full filesystem path to the scratch directory for job 1; (d) the full filesystem path to the scratch directory for job 2; (e) a screenshot for ls -l <the scratch directory for job 1>; (f) a screenshot for ls -l <the scratch directory for job 2>.

Below are sample outputs for the two cases. I have also attached the full output files. (a) job is running slowly -- this is when I submit 1000 jobs with #PBS -l nodes=1:ppn=1 and #PBS -J 1-1000:1. The total wall time for this job was 5 times longer than that of the normal job.

Sweep =   10 | Direction =  forward | Bond dimension =  128 | Noise =  0.00e+00 | Dav threshold =  1.00e-04
--> Site =    0-   1 .. Mmps =    1 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 1.27e+07 Tdav = 0.00 T = 0.02
--> Site =    1-   2 .. Mmps =    2 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 3.33e+07 Tdav = 0.00 T = 0.03
--> Site =    2-   3 .. Mmps =    3 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 1.88e+08 Tdav = 0.00 T = 0.02
--> Site =    3-   4 .. Mmps =    6 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 5.58e+08 Tdav = 0.00 T = 0.03
--> Site =    4-   5 .. Mmps =   10 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 2.17e+09 Tdav = 0.00 T = 0.02
--> Site =    5-   6 .. Mmps =   20 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 3.28e+09 Tdav = 0.00 T = 0.67
--> Site =    6-   7 .. Mmps =   35 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 3.78e+09 Tdav = 0.00 T = 0.54
--> Site =    7-   8 .. Mmps =   70 Ndav =   2 E =    -10.3119756621 Error = 0.00e+00 FLOPS = 6.98e+09 Tdav = 0.00 T = 0.03
--> Site =    8-   9 .. Mmps =  115 Ndav =   2 E =    -10.3120232032 Error = 1.73e-21 FLOPS = 9.16e+09 Tdav = 0.01 T = 0.04
--> Site =    9-  10 .. Mmps =  128 Ndav =   2 E =    -10.3120982585 Error = 8.06e-10 FLOPS = 1.09e+10 Tdav = 0.01 T = 0.06
--> Site =   10-  11 .. Mmps =  128 Ndav =   3 E =    -10.3123109084 Error = 3.41e-08 FLOPS = 1.08e+10 Tdav = 0.02 T = 0.05
--> Site =   11-  12 .. Mmps =  128 Ndav =   2 E =    -10.3123742117 Error = 6.36e-08 FLOPS = 1.11e+10 Tdav = 0.01 T = 0.05
--> Site =   12-  13 .. Mmps =  128 Ndav =   3 E =    -10.3125930047 Error = 9.20e-08 FLOPS = 1.10e+10 Tdav = 0.02 T = 0.06
--> Site =   13-  14 .. Mmps =  128 Ndav =   3 E =    -10.3128444200 Error = 7.94e-08 FLOPS = 1.08e+10 Tdav = 0.02 T = 0.06
--> Site =   14-  15 .. Mmps =  128 Ndav =   3 E =    -10.3134016406 Error = 8.59e-08 FLOPS = 1.09e+10 Tdav = 0.02 T = 0.07
--> Site =   15-  16 .. Mmps =  128 Ndav =   3 E =    -10.3136685023 Error = 1.23e-07 FLOPS = 1.08e+10 Tdav = 0.02 T = 0.06
--> Site =   16-  17 .. Mmps =  128 Ndav =   3 E =    -10.3141761214 Error = 2.95e-08 FLOPS = 1.04e+10 Tdav = 0.02 T = 0.06
--> Site =   17-  18 .. Mmps =  128 Ndav =   2 E =    -10.3142479911 Error = 4.64e-08 FLOPS = 9.21e+09 Tdav = 0.02 T = 0.05
--> Site =   18-  19 .. Mmps =  128 Ndav =   3 E =    -10.3144300449 Error = 1.13e-08 FLOPS = 8.53e+09 Tdav = 0.03 T = 0.07
--> Site =   19-  20 .. Mmps =  128 Ndav =   3 E =    -10.3145818553 Error = 5.08e-09 FLOPS = 7.03e+09 Tdav = 0.03 T = 0.09
--> Site =   20-  21 .. Mmps =  128 Ndav =   1 E =    -10.3145818211 Error = 2.92e-10 FLOPS = 9.28e+09 Tdav = 0.01 T = 0.04
--> Site =   21-  22 .. Mmps =  128 Ndav =   1 E =    -10.3145818195 Error = 1.38e-09 FLOPS = 8.34e+09 Tdav = 0.00 T = 0.04
--> Site =   22-  23 .. Mmps =  128 Ndav =   1 E =    -10.3145818182 Error = 4.14e-20 FLOPS = 6.80e+09 Tdav = 0.00 T = 0.03
--> Site =   23-  24 .. Mmps =   88 Ndav =   1 E =    -10.3145818182 Error = 2.84e-20 FLOPS = 4.88e+09 Tdav = 0.00 T = 0.03
--> Site =   24-  25 .. Mmps =   48 Ndav =   1 E =    -10.3145818182 Error = 1.55e-20 FLOPS = 3.02e+09 Tdav = 0.00 T = 0.02
--> Site =   25-  26 .. Mmps =   25 Ndav =   1 E =    -10.3145818182 Error = 7.79e-21 FLOPS = 9.79e+08 Tdav = 0.00 T = 0.02
--> Site =   26-  27 .. Mmps =   17 Ndav =   1 E =    -10.3145818182 Error = 6.33e-21 FLOPS = 1.94e+08 Tdav = 0.00 T = 0.08
--> Site =   27-  28 .. Mmps =    8 Ndav =   1 E =    -10.3145818182 Error = 3.12e-34 FLOPS = 3.95e+07 Tdav = 0.00 T = 0.02
--> Site =   28-  29 .. Mmps =    4 Ndav =   1 E =    -10.3145818182 Error = 3.46e-33 FLOPS = 5.24e+06 Tdav = 0.00 T = 0.35
Time elapsed =     15.178 | E =     -10.3145818553 | DE = -2.49e-03 | DW = 1.22842e-07
Time sweep =        2.692 | 2.36 GFLOP/SWP
| Dmem = 1.17 MB (20%) | Imem = 14.0 KB (96%) | Hmem = 4.19 MB | Wmem = 122 KB | Pmem = 0 B
| Tread = 0.041 | Twrite = 0.622 | Tfpread = 0.020 | Tfpwrite = 0.018 | Tmporead = 0.000 | Tasync = 0.000
| data = 12.8 MB | cpsd = 9.99 MB
| Trot = 0.034 | Tctr = 0.002 | Tint = 0.000 | Tmid = 0.000 | Tdctr = 0.032 | Tdiag = 0.008 | Tinfo = 0.018
| Teff = 0.106 | Tprt = 0.000 | Teig = 0.263 | Tblk = 1.997 | Tmve = 0.692 | Tdm = 0.002 | Tsplt = 0.025 | Tsvd = 0.000 | Torth = 0.000

(b) job is running normally (using ~100% CPU) -- this is when I submit only 5 jobs with #PBS -l nodes=cn018:ppn=1 and #PBS -J 1-5:1.

Sweep =   10 | Direction =  forward | Bond dimension =  128 | Noise =  0.00e+00 | Dav threshold =  1.00e-04
--> Site =    0-   1 .. Mmps =    1 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 2.11e+07 Tdav = 0.00 T = 0.00
--> Site =    1-   2 .. Mmps =    2 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 8.78e+07 Tdav = 0.00 T = 0.01
--> Site =    2-   3 .. Mmps =    3 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 3.50e+08 Tdav = 0.00 T = 0.01
--> Site =    3-   4 .. Mmps =    6 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 1.18e+09 Tdav = 0.00 T = 0.01
--> Site =    4-   5 .. Mmps =   10 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 3.03e+09 Tdav = 0.00 T = 0.01
--> Site =    5-   6 .. Mmps =   20 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 4.19e+09 Tdav = 0.00 T = 0.01
--> Site =    6-   7 .. Mmps =   35 Ndav =   1 E =    -10.3119640413 Error = 0.00e+00 FLOPS = 5.13e+09 Tdav = 0.00 T = 0.01
--> Site =    7-   8 .. Mmps =   70 Ndav =   2 E =    -10.3119756621 Error = 0.00e+00 FLOPS = 7.66e+09 Tdav = 0.00 T = 0.01
--> Site =    8-   9 .. Mmps =  115 Ndav =   2 E =    -10.3120232032 Error = 1.73e-21 FLOPS = 9.86e+09 Tdav = 0.01 T = 0.02
--> Site =    9-  10 .. Mmps =  128 Ndav =   2 E =    -10.3120982585 Error = 8.06e-10 FLOPS = 1.16e+10 Tdav = 0.01 T = 0.02
--> Site =   10-  11 .. Mmps =  128 Ndav =   3 E =    -10.3123109084 Error = 3.41e-08 FLOPS = 1.28e+10 Tdav = 0.02 T = 0.03
--> Site =   11-  12 .. Mmps =  128 Ndav =   2 E =    -10.3123742117 Error = 6.36e-08 FLOPS = 1.27e+10 Tdav = 0.01 T = 0.03
--> Site =   12-  13 .. Mmps =  128 Ndav =   3 E =    -10.3125930047 Error = 9.20e-08 FLOPS = 1.26e+10 Tdav = 0.02 T = 0.04
--> Site =   13-  14 .. Mmps =  128 Ndav =   3 E =    -10.3128444200 Error = 7.94e-08 FLOPS = 1.30e+10 Tdav = 0.02 T = 0.04
--> Site =   14-  15 .. Mmps =  128 Ndav =   3 E =    -10.3134016406 Error = 8.59e-08 FLOPS = 1.28e+10 Tdav = 0.02 T = 0.04
--> Site =   15-  16 .. Mmps =  128 Ndav =   3 E =    -10.3136685023 Error = 1.23e-07 FLOPS = 1.31e+10 Tdav = 0.02 T = 0.03
--> Site =   16-  17 .. Mmps =  128 Ndav =   3 E =    -10.3141761214 Error = 2.95e-08 FLOPS = 1.27e+10 Tdav = 0.02 T = 0.03
--> Site =   17-  18 .. Mmps =  128 Ndav =   2 E =    -10.3142479911 Error = 4.64e-08 FLOPS = 1.23e+10 Tdav = 0.01 T = 0.03
--> Site =   18-  19 .. Mmps =  128 Ndav =   3 E =    -10.3144300449 Error = 1.13e-08 FLOPS = 1.23e+10 Tdav = 0.02 T = 0.04
--> Site =   19-  20 .. Mmps =  128 Ndav =   3 E =    -10.3145818553 Error = 5.08e-09 FLOPS = 1.33e+10 Tdav = 0.02 T = 0.03
--> Site =   20-  21 .. Mmps =  128 Ndav =   1 E =    -10.3145818211 Error = 2.92e-10 FLOPS = 1.13e+10 Tdav = 0.01 T = 0.02
--> Site =   21-  22 .. Mmps =  128 Ndav =   1 E =    -10.3145818195 Error = 1.38e-09 FLOPS = 1.07e+10 Tdav = 0.00 T = 0.02
--> Site =   22-  23 .. Mmps =  128 Ndav =   1 E =    -10.3145818182 Error = 4.14e-20 FLOPS = 8.76e+09 Tdav = 0.00 T = 0.01
--> Site =   23-  24 .. Mmps =   88 Ndav =   1 E =    -10.3145818182 Error = 2.84e-20 FLOPS = 6.85e+09 Tdav = 0.00 T = 0.01
--> Site =   24-  25 .. Mmps =   48 Ndav =   1 E =    -10.3145818182 Error = 1.55e-20 FLOPS = 4.33e+09 Tdav = 0.00 T = 0.01
--> Site =   25-  26 .. Mmps =   25 Ndav =   1 E =    -10.3145818182 Error = 7.79e-21 FLOPS = 1.36e+09 Tdav = 0.00 T = 0.01
--> Site =   26-  27 .. Mmps =   17 Ndav =   1 E =    -10.3145818182 Error = 6.33e-21 FLOPS = 3.75e+08 Tdav = 0.00 T = 0.00
--> Site =   27-  28 .. Mmps =    8 Ndav =   1 E =    -10.3145818182 Error = 3.12e-34 FLOPS = 1.11e+08 Tdav = 0.00 T = 0.01
--> Site =   28-  29 .. Mmps =    4 Ndav =   1 E =    -10.3145818182 Error = 3.46e-33 FLOPS = 7.16e+06 Tdav = 0.00 T = 0.01
Time elapsed =      3.948 | E =     -10.3145818553 | DE = -2.49e-03 | DW = 1.22842e-07
Time sweep =        0.551 | 2.36 GFLOP/SWP
| Dmem = 1.17 MB (20%) | Imem = 14.0 KB (96%) | Hmem = 4.19 MB | Wmem = 122 KB | Pmem = 0 B
| Tread = 0.030 | Twrite = 0.058 | Tfpread = 0.017 | Tfpwrite = 0.018 | Tmporead = 0.000 | Tasync = 0.000
| data = 12.8 MB | cpsd = 9.99 MB
| Trot = 0.021 | Tctr = 0.001 | Tint = 0.000 | Tmid = 0.000 | Tdctr = 0.020 | Tdiag = 0.005 | Tinfo = 0.012
| Teff = 0.081 | Tprt = 0.000 | Teig = 0.206 | Tblk = 0.450 | Tmve = 0.099 | Tdm = 0.001 | Tsplt = 0.019 | Tsvd = 0.000 | Torth = 0.000

Below are the full outputs: (Note: I had to change the extension from .out to .txt in order to attach.) slow_job_output.txt normal_job_output.txt

(a) full filesystem path to the job scripts:
```
/data/home/shovan.dutta/AFM_Networks/block2/scripts/block2_cpu_usage_numjobs_1000_pbs.sh
/data/home/shovan.dutta/AFM_Networks/block2/scripts/block2_cpu_usage_numjobs_5_pbs.sh
```
(b) full filesystem path to the python script -- it is first copied into the (parent) scratch directory before running:
```
/data/home/shovan.dutta/AFM_Networks/block2/block2_AFM_Networks.py
```
(c) full filesystem path to the scratch directory for the slow job:
```
/data/home/shovan.dutta/AFM_Networks/block2/scratch/8753[1].hpc2020/tmp
```
(d) full filesystem path to the scratch directory for the normal job:
```
/data/home/shovan.dutta/AFM_Networks/block2/scratch/8751[1].hpc2020/tmp
```
(e1) screenshot of the parent scratch directory for the slow job: (I call python3.11 from this directory.)

(f1) screenshot of the parent scratch directory for the normal job:

(e2), (f2) output of ls -l of the scratch ("tmp") directories are attached below. slow_job_scratch_contents.txt normal_job_scratch_contents.txt

As explained in the documentation, the scratch directory should be set to somewhere in the high-I/O-speed scratch filesystem in the computer, rather than somewhere under /data/home. When thousands of jobs write to the same low-I/O-speed filesystem simultaneously, they are simply blocked by the slow I/O operation all the time.

To solve this problem:

If you have a high-speed scratch filesystem in your computer, set the scratch directory under the real scratch filesystem.
If you do not have a high-speed scratch filesystem but you have a large amount of free memory in each node, set the scratch directory under /dev/shm so that the memory space can be used for scratch files (which has high I/O speed). At the end of the DMRG script, use import shutil; shutil.rmtree(driver.scratch) to clean the scratch to avoid exhausting memory space for future jobs.
If you do not have a high-speed scratch filesystem and memory is limited, you should allow each job to use more threads and reduce the number of jobs running simultaneously until the pressure to the filesystem is reasonable.

Thanks a lot. I underestimated the importance of I/O operations during the execution of the code. The /dev/shm approach seems to work perfectly. Using a dedicated scratch filesystem also gave significant improvement; however, the wall time was still about 40% higher than the CPU time.

For scaling up the calculations it would be useful to know the following:

How often are we writing to scratch? I presume you store the MPO(s) and also the kets with different tags. Do you also update the stored MPS after each local optimization during the DMRG sweeps?
The link you quoted says that the stack_mem parameter in DMRGDriver should be large enough to store renormalized operators. If I understand it correctly, this memory resides in the RAM and not in the scratch directory. But I'm not sure what is meant by renormalized operators and how I can set stack_mem in practice. How would I know if it is insufficient? If I set it to some ad-hoc large value like 30 GB, will it just block that much RAM?
When not using /dev/shm, I found that the CPU usage dropped when calculating the spin correlations as follows:
```
corrvals = []
for i in range(L):
    for j in range(i+1,L):
        opsum = driver.expr_builder().add_term("(T+T)0", [i, j], - 3**0.5 / 2)
        op = driver.get_mpo(opsum.finalize(adjust_order=False))
        opexp = driver.expectation(mps_min, op, mps_min)
        corrvals.append(opexp)
```
Is this because we are writing to scratch every time op is redefined?

An alternative is to use the t-J model (as suggested here: https://github.com/block-hczhai/block2-preview/issues/114#issuecomment-2266661595) and evaluate
```
driver.get_2pdm(mps_min, npdm_expr='((C+D)2+(C+D)2)0', mask=(0, 0, 1, 1)) * (- 3**0.5 / 4)
```
However, I found this method is not always accurate, e.g., the total spin extracted from the correlations differs from the total-spin quantum number. One has to adjust the max_bond_dim parameter of get_npdm to improve accuracy. But what is this a bond dimension of? What is the default value? Is it possible to get the correlations to a desired accuracy?

Thanks for the feedback.

But I'm not sure what is meant by renormalized operators

If you are interested in the DMRG techniques and implementation details, please read the ab initio DMRG papers and the block2 source code.

how I can set stack_mem in practice. How would I know if it is insufficient? If I set it to some ad-hoc large value like 30 GB, will it just block that much RAM?

When it is insufficient, you will get an error message like exceeding allowed memory. If you set it to some reasonable large value, it will not block memory if it does not have to use that amount.

Is this because we are writing to scratch every time op is redefined?

The amount of disk IO operation is proportional to the number of sweeps. In this case you call driver.expectation many times so there are lots of disk IO operations. If you add a line mps_min = driver.adjust_mps(mps_min, dot=1)[0] outside the for loop, then driver.expectation will not write anything to the disk and you will get a better CPU usage.

One has to adjust the max_bond_dim parameter of get_npdm to improve accuracy. But what is this a bond dimension of? What is the default value? Is it possible to get the correlations to a desired accuracy?

Setting max_bond_dim to one to two times of the largest sweep MPS bond dimension should give you a reasonable accuracy. If you set the argument iprint=2 when calling get_npdm, the error during the npdm sweep will be printed. When you set a suitable max_bond_dim the printed error should be zero.

Great, thanks. Two quick questions:

Is there a way to get the actual (maximum) stack memory and scratch space used by the program?
For the get_npdm question, when you say the largest sweep MPS bond dimension, you mean what we get by mps.info.get_max_bond_dimension()? Is max_bond_dim set to this value by default?

From the DMRG output you can find

 | Dmem = 1.17 MB (20%) | Imem = 14.0 KB (96%) | Hmem = 4.19 MB | Wmem = 122 KB | Pmem = 0 B
...
 | data = 12.8 MB | cpsd = 9.99 MB

Approximately, the actual used "stack memory" is Dmem + Wmem = 1.17 MB + 122 KB, the actual used scratch space during driver.dmrg is cpsd + Wmem = 9.99 MB + 122 KB.

The largest sweep MPS bond dimension is max(bond_dims) in your script. The default can be equal to or larger than mps.info.get_max_bond_dimension().

block-hczhai / block2-preview

CPU usage on cluster #129