[ITensors] [BUG] Bad performance of DMRG in AMD CPU

ZipWin commented 8 months ago

I was running the same code for DMRG in an M3 MacBook Pro and an AMD EPYC 7763 64-Core Processor server. The speed in EPYC is much slower than M3. I also tested the same code in my AMD R7 4800h laptop, the speed is faster than EPYC but slower than M3. I'm not sure whether this is a problem of AMD CPU or not. Is there any method to improve the performance?

This is the output in M3

And this one is in EPYC

Minimal code My code is nothing but a simple DMRG

os = a 2D quantum spin model
N = 3 * 4 * 8
sites = siteinds("S=1/2", N)
H = MPO(os, sites)
psi0 = randomMPS(sites, 10)
energy, psi = dmrg(H, psi0; nsweeps=30, maxdim=60, cutoff=1E-5)

Version information

Output from versioninfo():

M3

julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 × Apple M3
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 16 on 4 virtual cores
Environment:
JULIA_NUM_THREADS = 16

EPYC

Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 256 × AMD EPYC 7763 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 128 on 256 virtual cores
Environment:
JULIA_NUM_THREADS = 128

Output from using Pkg; Pkg.status("ITensors"):

M3

julia> using Pkg; Pkg.status("ITensors")
Status `~/.julia/environments/v1.9/Project.toml`
[9136182c] ITensors v0.3.52

EPYC

julia> using Pkg; Pkg.status("ITensors")
Status `~/.julia/environments/v1.9/Project.toml`
[9136182c] ITensors v0.3.52

mtfishman commented 8 months ago

Thanks for the report. Very likely this is due to differences in the performance of BLAS and LAPACK on those two systems, I would recommend comparing BLAS/LAPACK functionality like matrix multiplication, SVD, etc. independent of ITensor and see if you see similar discrepancies.

kmp5VT commented 8 months ago

@mtfishman I see that I have access to some rome AMD EPYC™ 7002 CPU's which have 128 cores. So I can also run some tests performance testing

kmp5VT commented 8 months ago

Okay I have run a quick test with two different processors. The test looks like this

using ITensors, LinearAlgebra
N = 3 * 4 * 8
Nx = 3 * 4
Ny = 8
sites = siteinds("S=1/2", N)
lattice = square_lattice(Nx, Ny; yperiodic=false)

  os = OpSum()
  for b in lattice
    os .+= 0.5, "S+", b.s1, "S-", b.s2
    os .+= 0.5, "S-", b.s1, "S+", b.s2
    os .+= "Sz", b.s1, "Sz", b.s2
  end
  H = MPO(os, sites)

  state = [isodd(n) ? "Up" : "Dn" for n in 1:N]
psi0 = randomMPS(sites, state, 10)
energy, psi = dmrg(H, psi0; nsweeps=30, maxdim=60, cutoff=1E-5

I am using the AMD rome chip and a cascade lake Intel chip. The AMD info for versioninfo is

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/pmwk60bp5k4qr8vsg411p7vzhr502d83-openblas-0.3.23/lib

And the cascadelake is

  julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
  Threads: 1 on 32 virtual cores
Environment:
  LD_LIBRARY_PATH = /mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib:/cm/shared/apps/slurm/current/lib64

The AMD has a clock speed of 2.25Ghz and 64cores (128 threads) and the Intel has a clocks peed of 3.6Ghz and 32 cores. That puts my estimate at 2.3 TFLOPS for the AMD and ~1.8TFLOPS for the intel. Maybe I should but I am not considering overclock speed for these estimates. I made sure both were using the openblas linear algebra and I do not have MKL loaded on the intel chip.
** When I look up the AMD core it says there are 64 cores and 128 threads so I am not sure if I should use 64 or 128 so my FLOP count could be off by a factor of two. This reference implies that I am off by a factor of 2. This fact only makes my AMD results look worse. Here is the output for the first few iterations on AMD

  After sweep 1 energy=-57.54141295684079  maxlinkdim=38 maxerr=9.95E-06 time=5.758
After sweep 2 energy=-58.904767252032805  maxlinkdim=60 maxerr=3.38E-04 time=7.347
After sweep 3 energy=-59.11224893888357  maxlinkdim=60 maxerr=5.26E-04 time=8.478
After sweep 4 energy=-59.150691419028654  maxlinkdim=60 maxerr=6.03E-04 time=8.192
After sweep 5 energy=-59.164719100449744  maxlinkdim=60 maxerr=6.33E-04 time=6.386
After sweep 6 energy=-59.17082567081569  maxlinkdim=60 maxerr=6.44E-04 time=6.386
After sweep 7 energy=-59.174711234075126  maxlinkdim=60 maxerr=6.46E-04 time=7.625

and here is the output for Intel

After sweep 1 energy=-57.59077526962814  maxlinkdim=39 maxerr=9.98E-06 time=0.694
After sweep 2 energy=-58.90697215419522  maxlinkdim=60 maxerr=3.26E-04 time=2.616
After sweep 3 energy=-59.118974149770764  maxlinkdim=60 maxerr=5.14E-04 time=3.790
After sweep 4 energy=-59.14627690267304  maxlinkdim=60 maxerr=5.79E-04 time=3.603
After sweep 5 energy=-59.15889103852813  maxlinkdim=60 maxerr=6.06E-04 time=3.108
After sweep 6 energy=-59.1689043895468  maxlinkdim=60 maxerr=6.35E-04 time=3.512
After sweep 7 energy=-59.1743327295998  maxlinkdim=60 maxerr=6.56E-04 time=3.536

So it does look like AMD is running significantly slower. This could potentially be related to slurm I have talked to Miles about something weird I have found with slurm. I do not use slurm to run on Intel, just AMD. To be sure I am trying to run on an Intel icelake node that I have access to. I will update when I have those results

kmp5VT commented 8 months ago

Update on the icelake node. Here are the results for the first few iterations

After sweep 1 energy=-57.63326237341011  maxlinkdim=38 maxerr=9.95E-06 time=9.316
After sweep 2 energy=-58.920829223723956  maxlinkdim=60 maxerr=3.22E-04 time=2.890
After sweep 3 energy=-59.12751248183843  maxlinkdim=60 maxerr=5.89E-04 time=3.399
After sweep 4 energy=-59.15597966767008  maxlinkdim=60 maxerr=6.52E-04 time=2.623
After sweep 5 energy=-59.1670197697073  maxlinkdim=60 maxerr=6.64E-04 time=2.420
After sweep 6 energy=-59.172930867376905  maxlinkdim=60 maxerr=6.74E-04 time=2.552
After sweep 7 energy=-59.176015668923334  maxlinkdim=60 maxerr=6.79E-04 time=2.629

And here is the versioninfo

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
  Threads: 1 on 64 virtual cores
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib

And the node has a peak performance of 2.86TFLOPS and shows that the performance issue does not seem to be related to Slurm

ITensor / ITensors.jl

[ITensors] [BUG] Bad performance of DMRG in AMD CPU #1298