Open ZipWin opened 10 months ago
Thanks for the report. Very likely this is due to differences in the performance of BLAS and LAPACK on those two systems, I would recommend comparing BLAS/LAPACK functionality like matrix multiplication, SVD, etc. independent of ITensor and see if you see similar discrepancies.
@mtfishman I see that I have access to some rome
AMD EPYC™ 7002 CPU's which have 128 cores. So I can also run some tests performance testing
Okay I have run a quick test with two different processors. The test looks like this
using ITensors, LinearAlgebra
N = 3 * 4 * 8
Nx = 3 * 4
Ny = 8
sites = siteinds("S=1/2", N)
lattice = square_lattice(Nx, Ny; yperiodic=false)
os = OpSum()
for b in lattice
os .+= 0.5, "S+", b.s1, "S-", b.s2
os .+= 0.5, "S-", b.s1, "S+", b.s2
os .+= "Sz", b.s1, "Sz", b.s2
end
H = MPO(os, sites)
state = [isodd(n) ? "Up" : "Dn" for n in 1:N]
psi0 = randomMPS(sites, state, 10)
energy, psi = dmrg(H, psi0; nsweeps=30, maxdim=60, cutoff=1E-5
I am using the AMD rome
chip and a cascade lake Intel chip. The AMD info for versioninfo
is
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 128 × AMD EPYC 7742 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 1 on 128 virtual cores
Environment:
LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/pmwk60bp5k4qr8vsg411p7vzhr502d83-openblas-0.3.23/lib
And the cascadelake is
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 on 32 virtual cores
Environment:
LD_LIBRARY_PATH = /mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib:/cm/shared/apps/slurm/current/lib64
The AMD has a clock speed of 2.25Ghz and 64cores (128 threads) and the Intel has a clocks peed of 3.6Ghz and 32 cores. That puts my estimate at 2.3 TFLOPS for the AMD and ~1.8TFLOPS for the intel. Maybe I should but I am not considering overclock speed for these estimates. I made sure both were using the openblas
linear algebra and I do not have MKL loaded on the intel chip.
** When I look up the AMD core it says there are 64 cores and 128 threads so I am not sure if I should use 64 or 128 so my FLOP count could be off by a factor of two. This reference implies that I am off by a factor of 2. This fact only makes my AMD results look worse.
Here is the output for the first few iterations on AMD
After sweep 1 energy=-57.54141295684079 maxlinkdim=38 maxerr=9.95E-06 time=5.758
After sweep 2 energy=-58.904767252032805 maxlinkdim=60 maxerr=3.38E-04 time=7.347
After sweep 3 energy=-59.11224893888357 maxlinkdim=60 maxerr=5.26E-04 time=8.478
After sweep 4 energy=-59.150691419028654 maxlinkdim=60 maxerr=6.03E-04 time=8.192
After sweep 5 energy=-59.164719100449744 maxlinkdim=60 maxerr=6.33E-04 time=6.386
After sweep 6 energy=-59.17082567081569 maxlinkdim=60 maxerr=6.44E-04 time=6.386
After sweep 7 energy=-59.174711234075126 maxlinkdim=60 maxerr=6.46E-04 time=7.625
and here is the output for Intel
After sweep 1 energy=-57.59077526962814 maxlinkdim=39 maxerr=9.98E-06 time=0.694
After sweep 2 energy=-58.90697215419522 maxlinkdim=60 maxerr=3.26E-04 time=2.616
After sweep 3 energy=-59.118974149770764 maxlinkdim=60 maxerr=5.14E-04 time=3.790
After sweep 4 energy=-59.14627690267304 maxlinkdim=60 maxerr=5.79E-04 time=3.603
After sweep 5 energy=-59.15889103852813 maxlinkdim=60 maxerr=6.06E-04 time=3.108
After sweep 6 energy=-59.1689043895468 maxlinkdim=60 maxerr=6.35E-04 time=3.512
After sweep 7 energy=-59.1743327295998 maxlinkdim=60 maxerr=6.56E-04 time=3.536
So it does look like AMD is running significantly slower. This could potentially be related to slurm
I have talked to Miles about something weird I have found with slurm. I do not use slurm to run on Intel, just AMD. To be sure I am trying to run on an Intel icelake node that I have access to. I will update when I have those results
Update on the icelake node. Here are the results for the first few iterations
After sweep 1 energy=-57.63326237341011 maxlinkdim=38 maxerr=9.95E-06 time=9.316
After sweep 2 energy=-58.920829223723956 maxlinkdim=60 maxerr=3.22E-04 time=2.890
After sweep 3 energy=-59.12751248183843 maxlinkdim=60 maxerr=5.89E-04 time=3.399
After sweep 4 energy=-59.15597966767008 maxlinkdim=60 maxerr=6.52E-04 time=2.623
After sweep 5 energy=-59.1670197697073 maxlinkdim=60 maxerr=6.64E-04 time=2.420
After sweep 6 energy=-59.172930867376905 maxlinkdim=60 maxerr=6.74E-04 time=2.552
After sweep 7 energy=-59.176015668923334 maxlinkdim=60 maxerr=6.79E-04 time=2.629
And here is the versioninfo
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
Threads: 1 on 64 virtual cores
Environment:
LD_LIBRARY_PATH = /cm/shared/apps/slurm/current/lib64:/mnt/sw/nix/store/hayjz1l94cb2ky37bhcv71aygjzq7fci-openblas-0.3.21/lib
And the node has a peak performance of 2.86TFLOPS and shows that the performance issue does not seem to be related to Slurm
I was running the same code for DMRG in an M3 MacBook Pro and an AMD EPYC 7763 64-Core Processor server. The speed in EPYC is much slower than M3. I also tested the same code in my AMD R7 4800h laptop, the speed is faster than EPYC but slower than M3. I'm not sure whether this is a problem of AMD CPU or not. Is there any method to improve the performance?
This is the output in M3
And this one is in EPYC
Minimal code My code is nothing but a simple DMRG
Version information
Output from
versioninfo()
:M3
EPYC
Output from
using Pkg; Pkg.status("ITensors")
:M3
EPYC