Open kostrzewa opened 11 years ago
Oh, I guess all that's necessary is to finalize MPI in global_barrier() in DirectPut.c (or use the fatal_error() function)
Fixing it right now, will do pull-request on InterleavedNDTwistedClover branch
Is it correct that upon failure of msg_InjFifoInit, the code does not exit?
Oh, I see, the calling code handles the error here.
Darn, I tried my introduction of fatal_error and it seems that the program just locks up and doesn't even get to "exit" or "fatal_error".... I think I'll have to investigate the whole send counter polling...
Okay, I got a bit confused as to where this timeout was happening so the send polling is probably irrelevant. I don't know what's happening though, it seems like the code is just simply unable to exit cleanly...
Do you think it would be possible to use the job script to periodically (say every 2 minutes or so) poll for the error in the output file and cancel the job in case it occurs?
I seem to have discovered a pattern in the failure. All of the failures occured when the mapping was EABCDT and the loadl_shape was 2x1x1x1... Weird, no?
Okay, so I've come up with this, I just don't know if SPI_ERROR will be set correctly as the function "catch_SPI_error" is arguably launched in its own shell.
# @ job_name = iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009
# @ error = $(job_name).$(jobid).out
# @ output = $(job_name).$(jobid).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 24:00:00
# @ notification = always
# @ notify_user = bartosz.kostrzewa@desy.de
# @ job_type = bluegene
# @ bg_connectivity = TORUS
# @ bg_size = 1024
# @ queue
export NP=1024
export NPN=1
export RUNDIR=runs/nf2
export OMP_NUM_THREADS=64
export NAME="iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009"
export WDIR=${WORK}/${RUNDIR}/${NAME}
#export EXEC=${HOME}/code/tmLQCD.kost/build_4D_hybrid_hs_lemon_unstable/hmc_tm
export EXEC=${HOME}/code/tmLQCD.kost/build_4D_hybrid_hs/hmc_tm
export SPI_ERROR=0
catch_SPI_error() {
while [ 1 -eq 1 ]; do
if [[ ! -e ${LOADL_STEP_OUT} ]]; then
sleep 20
else
grep -q 'MUSPI' ${LOADL_STEP_OUT}
GREPRETVAL=$?
if test $GREPRETVAL -eq 0; then
llcancel ${LOADL_JOB_NAME}
export SPI_ERROR=1
break
else
sleep 60
fi
fi
done
}
if [[ ! -d ${WDIR} ]]
then
mkdir -p ${WDIR}
fi
cp hmc.input ${WDIR}
cd ${WDIR}
date
echo loadl shape is $LOADL_BG_SHAPE
export MP=EABCDT
case ${LOADL_BG_SHAPE} in
2x1x1x1 )
MP=EABCDT
;;
1x2x1x1 )
MP=EBACDT
;;
1x1x2x1 )
MP=ECABDT
;;
1x1x1x2 )
MP=EDABCT
;;
esac
echo mapping is ${MP}
# run background function to catch the SPI error
catch_SPI_error &
runjob --mapping ${MP} --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node ${NPN} --np ${NP} --cwd ${WDIR} --exe ${EXEC}
RETVAL=$?
if test $SPI_ERROR -eq 1; then
RETVAL=666
fi
date
exit $RETVAL
okay, tested the abort script with some modifications and it works fine in a test-case, it doesn't even matter if SPI_ERROR is set or not because the output will never be reached, i will add some output in the function instead
this provides a working saver of computer time while we try to fix the bug
export OFILE=${LOADL_STEP_INITDIR}/${LOADL_STEP_OUT}
catch_SPI_error() {
while [ 1 -eq 1 ]; do
if [[ ! -e ${OFILE} ]]; then
sleep 20
else
grep -q 'MUSPI' ${OFILE}
GREPRETVAL=$?
if test $GREPRETVAL -eq 0; then
echo "----- SPI error detected in output file, calling llcancel! -----"
llcancel ${LOADL_STEP_ID}
break
else
sleep 60
fi
fi
done
}
catch_SPI_error &
runjob ...
of course, ${OFILE} may be different depending on what has been set (so whether the output is done in the initial working directory from where the job was submitted)
full job script in
/homeb/pra073/pra07309/runs/nf2/iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009
I would classify this as a critical problem which needs to be resolved because it's wasted at least one rackday of computing time. Sometimes the code will just time out during SPI exchange resuling in a lockup instead of exiting.
I'm pretty sure the condition can be caught and reacted to accordingly. I don't really know how to fix this though. @urbach