firemodels / fds

Fire Dynamics Simulator
https://pages.nist.gov/fds-smv/
Other
668 stars 625 forks source link

Jobs that fail on firebot every few days #13433

Open mcgratta opened 2 months ago

mcgratta commented 2 months ago

When I run 64 instances of this case in the firebot queue, all are successful. However, if I switch to the uneven meshes, some will fail. These cases do not have to be run with other cases, like the verification cases. They can be run alone.

&HEAD CHID='simple_caseNNN' /

&TIME T_END=60. /

&MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,0.0,1.0 /
&MESH IJK=50,50,50, XB=1.0,2.0,0.0,1.0,0.0,1.0 /
&MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,1.0,2.0 /
&MESH IJK=50,50,50, XB=1.0,2.0,0.0,1.0,1.0,2.0 /

 MESH IJK=50,50,50, XB=0.0,1.0,0.0,1.0,0.0,1.0 /
 MESH IJK=25,25,25, XB=1.0,2.0,0.0,1.0,0.0,1.0 /
 MESH IJK=25,25,25, XB=0.0,1.0,0.0,1.0,1.0,2.0 /
 MESH IJK=25,25,25, XB=1.0,2.0,0.0,1.0,1.0,2.0 /

&TAIL /

Run 64 cases on firebot, designating one mesh per node.

#!/bin/bash

qfds.sh -p 4 -n 1 -q firebot simple_case_01.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_02.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_03.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_04.fds
qfds.sh -p 4 -n 1 -q firebot simple_case_05.fds
...
mcgratta commented 2 months ago

I compiled the code in dv mode and run 64 unbalanced cases on nodes 31-34. A few jobs hung. I killed one of them and got line numbers of 1833 and 1834 in main, called from the line 689. These are ALLREDUCE calls, one after the other. The fact that one process was hung up on the first call, and another on the second is suspicious. I added MPI_BARRIERs to this routine to see if I can still get the hang.

mcgratta commented 2 months ago

I added some MPI_BARRIERs and I see that process 1 was stuck at the barrier while 2 and 3 were stuck at the first allreduce.

IF (N_MPI_PROCESSES>1) THEN
   CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
   CALL MPI_ALLREDUCE(MPI_IN_PLACE,DSUM_ALL(1),N_ZONE,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,IERR)

Check if your cases are handing in this vicinity.

marcosvanella commented 2 months ago

Ok, I'm testing the cases with impi first in firebot. The balanced cases went through fine. I'm testing uneven cases now. MPI_ALLREDUCE call looks fine. I'm going from your template input file. I build input files and submit with this script:

#!/bin/bash

myqfds=/home/mnv/FireModels_fork_home/fds/Utilities/Scripts/qfds.sh

for n in $(seq 1 64);
do
    if [ $n -lt 10 ]; then
    echo 0$n
    cp simple_caseNNN.fds simple_case_0$n.fds
    sed -i "s/NNN/_0$n/g" simple_case_0$n.fds
    $myqfds -p 4 -n 1 -T dv -q firebot simple_case_0$n.fds
    else
    echo $n
    cp simple_caseNNN.fds simple_case_$n.fds
    sed -i "s/NNN/_$n/g" simple_case_$n.fds
    $myqfds -p 4 -n 1 -T dv -q firebot simple_case_$n.fds
    fi
done
marcosvanella commented 2 months ago

I had three hangs for the uneven cases with the impi -T dv options, but could not retrieve backtrace information to the FDS source. The ompi_gnu_linux cases did not hang. We can try changing the compilation flags for the impi dv target to see if we get more information.