PrincetonUniversity / tristan-mp-v2

Tristan-MP v2 [public]
https://princetonuniversity.github.io/tristan-v2/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

Got stuck when running in multiple cpus/cores at first time steps #13

Closed JiamingZhuge closed 7 months ago

JiamingZhuge commented 10 months ago

Hello Hayk, Recently I found that the latest version of tristan-mp-v2 somehow got stuck in the first output when I run in multiple cpus (mpirun -np [>1]). It got stuck after output:

== Full simulation parameters ========================================== ...
== Fiducial physical parameters ========================================
skin depth [dx]: 8.000 plasma oscillation period [dt]: 100.531 gyroradius [dx]: 8.000 gyration period [dt]: 100.531 ........................................................................

in the terminal. The output diag.log shows something happen after writing parameters:

...initializeSimulation() ...distributeMeshblocks() ...initializeLB() ...initializeFields() ...initializeParticles() ...initializePrtlExchange() ...initializeRandomSeed() ......exchangeFields() ...userInitialize() ...checkEverything() InitializeAll() Starting mainloop() ............................................................ Starting timestep # 0 ......advanceBHalfstep() ......exchangeFields() ......moveParticles() ......advanceBHalfstep() ......exchangeFields() ......advanceEFullstep() ......depositCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......exchangeCurrents() ......filterCurrents() ......addCurrents() ......exchangeFields() ......exchangeParticles() ......clearGhostParticles() ......clearGhostParticles() .........writeParams()

(stuck here) This problem only occurs when I run in multiple cores in my computer or cpus in cluster, everything goes well when I run in 1 cpu/core (mpirun -np 1). Also I find the old version (maybe a month before, I cloned) can work in multiple cores. I run the code in this command (openmpi):

make clean python3 configure.py -mpi08 -hdf5 --user=user_rad_shock -2d make all -j mkdir output_test mpirun -np 4 bin/tristan-mp2d -i inputs/input.rad_shock -o output_test/

I try to add debug flag python3 configure.py -mpi08 -hdf5 --user=user_rad_shock -2d --debug=2, shows (after modify configure.py line 270 from "debug" to "d0ebug"):

Traceback (most recent call last): File "/home/zhuge/Desktop/tristan-mp-v2/configure.py", line 280, in if int(args["debug"]) >= 2 and (args["compiler"] == "intel"): KeyError: 'compiler'

Since I don't use intel complier, maybe "complier" flag also need set? And what document should I check for the debug information if works?

haykh commented 10 months ago

@JiamingZhuge could you provide a bit more info on what compiler you're using, and whether you are using the parallel hdf5? One way to check, is to ensure that h5pfc exists, and is being used by the compiler. Another issue might be if you're using the wrong version of MPI (not the one that parallel hdf5 is compiled with).

Is this on a cluster?

JiamingZhuge commented 10 months ago

Of course! Let me provide that information on cluster. I run in a cluster and load those modules:

module list

Currently Loaded Modules: 1) shared 3) cpu/0.17.3b (c) 5) ucx/1.10.1/dnpjjuc 7) openjdk/11.0.12_7/27cv2ps 9) anaconda3/2021.05/q4munrg 2) slurm/expanse/21.08.8 4) gcc/10.2.0/npcyll4 6) openmpi/4.1.3/oq3qvsv 8) hdf5/1.10.7/5o4oibc

Where: c: built natively for AMD Rome

Here the hdf5 is the openmpi version:

module spider hdf5/1.10.7/5o4oibc

hdf5/1.10.7: hdf5/1.10.7/5o4oibc

You will need to load all module(s) on any one of the lines below before the "hdf5/1.10.7/5o4oibc" module is available to load.

  cpu/0.17.3b  gcc/10.2.0/npcyll4  openmpi/4.1.3/oq3qvsv

Help:
  HDF5 is a data model, library, and file format for storing and managing
  data. It supports an unlimited variety of datatypes, and is designed for
  flexible and efficient I/O and for high volume and complex data.

As for the gcc:

gcc -v Reading specs from /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib/gcc/x86_64-pc-linux-gnu/10.2.0/specs COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/libexec/gcc/x86_64-pc-linux-gnu/10.2.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /scratch/spack_cpu/job_21694812/spack-stage/spack-stage-gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/spack-src/configure --prefix=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd --with-pkgversion='Spack GCC' --with-bugurl=https://github.com/spack/spack/issues --disable-multilib --enable-languages=c,c++,fortran --disable-nls --with-system-zlib --with-zstd-include=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zstd-1.5.0-ixhjq2kjkwwiubjqtzompy3ovx3xskjy/include --with-zstd-lib=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zstd-1.5.0-ixhjq2kjkwwiubjqtzompy3ovx3xskjy/lib --disable-bootstrap --with-mpfr-include=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpfr-4.1.0-2gn43ksz5mn4l2ydhukvmf2hc5n6lsu2/include --with-mpfr-lib=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpfr-4.1.0-2gn43ksz5mn4l2ydhukvmf2hc5n6lsu2/lib --with-gmp-include=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gmp-6.2.1-6d5recuzoijnpzdmyuyatwr32y6e756r/include --with-gmp-lib=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gmp-6.2.1-6d5recuzoijnpzdmyuyatwr32y6e756r/lib --with-mpc-include=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpc-1.1.0-7brtlqfdvz2iwdzeyd23igqlwz3fq4d5/include --with-mpc-lib=/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpc-1.1.0-7brtlqfdvz2iwdzeyd23igqlwz3fq4d5/lib --without-isl --with-stage1-ldflags='-Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64 -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gmp-6.2.1-6d5recuzoijnpzdmyuyatwr32y6e756r/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpc-1.1.0-7brtlqfdvz2iwdzeyd23igqlwz3fq4d5/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpfr-4.1.0-2gn43ksz5mn4l2ydhukvmf2hc5n6lsu2/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zlib-1.2.11-bmchsimapzrndjqxvin7wptdiiwoxdqq/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zstd-1.5.0-ixhjq2kjkwwiubjqtzompy3ovx3xskjy/lib' --with-boot-ldflags='-Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64 -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gmp-6.2.1-6d5recuzoijnpzdmyuyatwr32y6e756r/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpc-1.1.0-7brtlqfdvz2iwdzeyd23igqlwz3fq4d5/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/mpfr-4.1.0-2gn43ksz5mn4l2ydhukvmf2hc5n6lsu2/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zlib-1.2.11-bmchsimapzrndjqxvin7wptdiiwoxdqq/lib -Wl,-rpath,/cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/zstd-1.5.0-ixhjq2kjkwwiubjqtzompy3ovx3xskjy/lib -static-libstdc++ -static-libgcc' Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.2.0 (Spack GCC)

JiamingZhuge commented 8 months ago

Hello Hayk, I tried to use intel-mpi, but thing still not goes well. Something wrong when I compile the codes.

ifort: command line warning #10006: ignoring unknown option '-ffree-line-length-512'
ifort: command line warning #10006: ignoring unknown option '-J'
ifort: warning #10145: no action performed for file 'build/'

I load

module load cpu/0.15.4 intel/19.1.1.217 intel-mpi/2019.8.254 hdf5/1.10.6

and used the default setting.

python3 configure.py -mpi08 -hdf5 --user=user_rad_shock -2d

Is there anything I need reset when compiling? Thanks!

JiamingZhuge commented 7 months ago

I solve the problem and run the code.

akbwyfc commented 2 months ago

Hi @JiamingZhuge how did you solve the problem? I think it is not related to multicores, Sometimes, first run is Ok with multicores, but in the second run would stuck.

Thanks in advance.

JiamingZhuge commented 2 months ago

Hi @akbwyfc, I change the mpi from openmpi to other mpi, and it report more details. You are right, the problem I met was not related to multicores. Which step it got stuck? Do you change anything when run the second time?