ammarhakim / gkyl

This is the main source repo for the Gkeyll 2.0 code. Please see gkeyll.rtfd.io for details.
https://gkeyll.readthedocs.io/en/latest/
54 stars 14 forks source link

pre-g0: mismatched send-recv warnings at end of execution #179

Open cwsmith opened 2 months ago

cwsmith commented 2 months ago

Hello,

On the Purdue Anvil system (cpus only) I'm hitting OpenMPI UCX (infiniband) warnings at the end of execution of the vm-tsw-2x2v.lua example here (https://gkyl.readthedocs.io/en/latest/quickstart/inputFiles/vm-tsw-2x2v.html).

The change to run vm-tsw-2x2v.lua on multiple ranks, the output from execution including the warnings, and the job submission scripts are pasted below.

A quick search lead me to this github issue:

https://github.com/openucx/ucx/issues/6331#issuecomment-778428537

which indicates that some messages that were sent were not received before MPI_Finalize was called.

Note, I also hit these warnings on SDSC Expanse. In both cases the system install of OpenMPI was used.

change to run with multiple ranks

x-cwsmith@login03.anvil:[quickstart] $ diff vm-tsw-2x2v.lua vm-tsw-2x2v_orig.lua
101c101
<    decompCuts = {5,2},                      -- Cuts in each configuration direction
---
>    decompCuts = {1,1},                      -- Cuts in each configuration direction

output

Wed Apr 10 2024 20:21:49.000000000
Gkyl built with 9663ea594e80+
Gkyl built on Apr 10 2024 16:08:26
Initializing Vlasov-Maxwell simulation ...
Initialization completed in 1.01092 sec

Starting main loop of Vlasov-Maxwell simulation ...

 Step 0 at time 0. Time step 0.0360652. Completed 0%
0123456789 Step    139 at time    5.0130661.  Time step  3.606522e-02.  Completed 10%
0123456789 Step    278 at time    10.026132.  Time step  3.606522e-02.  Completed 20%
0123456789 Step    416 at time    15.003133.  Time step  3.606522e-02.  Completed 30%
0123456789 Step    555 at time    20.016199.  Time step  3.606522e-02.  Completed 40%
0123456789 Step    694 at time    25.029265.  Time step  3.606522e-02.  Completed 50%
0123456789 Step    832 at time    30.006266.  Time step  3.606522e-02.  Completed 60%
0123456789 Step    971 at time    35.019332.  Time step  3.606522e-02.  Completed 70%
0123456789 Step   1110 at time    40.032398.  Time step  3.606522e-02.  Completed 80%
0123456789 Step   1248 at time    45.009399.  Time step  3.606522e-02.  Completed 90%
0123456789 Step   1387 at time    50.000000.  Time step  3.606522e-02.  Completed 100%
0

Total number of time-steps 1388
   Number of forward-Euler calls 5548
   Number of RK stage-2 failures 0
   Number of RK stage-3 failures 0
Solver took                                  102.58454 s   ( 0.073908 s/step)   (75.273%)
Solver BCs took                                8.60794 s   ( 0.006202 s/step)   ( 6.316%)
Field solver took                              1.14156 s   ( 0.000822 s/step)   ( 0.838%)
Field solver BCs                               0.27180 s   ( 0.000196 s/step)   ( 0.199%)
Function field solver took                     0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Moment calculations took                       8.34645 s   ( 0.006013 s/step)   ( 6.124%)
Integrated moment calculations took            5.64078 s   ( 0.004064 s/step)   ( 4.139%)
Field energy calculations took                 0.04717 s   ( 0.000034 s/step)   ( 0.035%)
Collision solver(s) took                       0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Collision (other) took                         0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Source updaters took                           0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Stepper combine/copy took                      3.44240 s   ( 0.002480 s/step)   ( 2.526%)
Forward Euler combine took                     0.00000 s   ( 0.000000 s/step)   ( 0.000%)
Time spent in barrier function                 0.21111 s   ( 0.000152 s/step)   ( 0.155%)
Data write took                                5.74470 s   ( 0.004139 s/step)   ( 4.215%)
Write restart took                             0.02960 s   ( 0.000021 s/step)   ( 0.022%)
[Unaccounted for]                              6.19993 s   ( 0.004467 s/step)   ( 4.549%)

Main loop completed in                       136.28258 s   ( 0.098186 s/step)   (   100%)

Wed Apr 10 2024 20:24:07.000000000
[1712795047.136074] [a877:220516:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x1aaadc0 was not matched
[1712795047.136109] [a877:220516:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x1aab0c0 was not matched
<.... snip ....  UCX  WARN  unexpected tag-receive appears 21 times>

job scripts

slurm submission script

x-cwsmith@login03.anvil:[quickstart] $ cat submit.sh 
#!/bin/bash -ex 
opts="--mail-user=smithc11@rpi.edu --mail-type=ALL"
sbatch $opts -p wholenode -A ##### -n 10 -N 1 -t 10 ./vmTsw.sh
x-cwsmith@login03.anvil:[quickstart] $ cat vmTsw.sh 

run script

#!/bin/bash
gkyl=/anvil/projects/########/cws/gkyllPreG0Dev/gkeyllSoftCpu/bin/gkyl
srun -n ${SLURM_NPROCS} $gkyl vm-tsw-2x2v.lua

two small changes to gkyl pre-g0

One was for the adios url, pre #178 being merged, and another was to pick up the correct python version.

x-cwsmith@login07: /anvil/projects/x-phy220105/cws/gkyllPreG0Dev/gkyl (pre-g0)$ git diff
diff --git a/install-deps/build-adios.sh b/install-deps/build-adios.sh
index 280082b8..49f5129d 100755
--- a/install-deps/build-adios.sh
+++ b/install-deps/build-adios.sh
@@ -8,7 +8,7 @@ PREFIX=$GKYLSOFT/adios-1.13.1
 # delete old checkout and builds
 rm -rf adios-1.13.1.tar* adios-1.13.1

-curl -L http://users.nccs.gov/~pnorbert/adios-1.13.1.tar.gz > adios-1.13.1.tar.gz
+curl -L https://users.nccs.gov/~pnorbert/adios-1.13.1.tar.gz > adios-1.13.1.tar.gz
 gunzip adios-1.13.1.tar.gz
 tar -xvf adios-1.13.1.tar
 cd adios-1.13.1
diff --git a/waf b/waf
index 7ceee167..df825656 100755
--- a/waf
+++ b/waf
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3.12
 # encoding: latin-1
 # Thomas Nagy, 2005-2018
 #