Open mcgratta opened 2 months ago
I compiled the code in dv
mode and run 64 unbalanced cases on nodes 31-34. A few jobs hung. I killed one of them and got line numbers of 1833 and 1834 in main, called from the line 689. These are ALLREDUCE
calls, one after the other. The fact that one process was hung up on the first call, and another on the second is suspicious. I added MPI_BARRIER
s to this routine to see if I can still get the hang.
I added some MPI_BARRIER
s and I see that process 1 was stuck at the barrier while 2 and 3 were stuck at the first allreduce.
IF (N_MPI_PROCESSES>1) THEN
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
CALL MPI_ALLREDUCE(MPI_IN_PLACE,DSUM_ALL(1),N_ZONE,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,IERR)
Check if your cases are handing in this vicinity.
Ok, I'm testing the cases with impi first in firebot. The balanced cases went through fine. I'm testing uneven cases now. MPI_ALLREDUCE call looks fine. I'm going from your template input file. I build input files and submit with this script:
#!/bin/bash
myqfds=/home/mnv/FireModels_fork_home/fds/Utilities/Scripts/qfds.sh
for n in $(seq 1 64);
do
if [ $n -lt 10 ]; then
echo 0$n
cp simple_caseNNN.fds simple_case_0$n.fds
sed -i "s/NNN/_0$n/g" simple_case_0$n.fds
$myqfds -p 4 -n 1 -T dv -q firebot simple_case_0$n.fds
else
echo $n
cp simple_caseNNN.fds simple_case_$n.fds
sed -i "s/NNN/_$n/g" simple_case_$n.fds
$myqfds -p 4 -n 1 -T dv -q firebot simple_case_$n.fds
fi
done
I had three hangs for the uneven cases with the impi -T dv options, but could not retrieve backtrace information to the FDS source. The ompi_gnu_linux cases did not hang. We can try changing the compilation flags for the impi dv target to see if we get more information.
When I run 64 instances of this case in the firebot queue, all are successful. However, if I switch to the uneven meshes, some will fail. These cases do not have to be run with other cases, like the verification cases. They can be run alone.
Run 64 cases on firebot, designating one mesh per node.