Critical lock-up bug on BG/Q

kostrzewa commented 11 years ago

I would classify this as a critical problem which needs to be resolved because it's wasted at least one rackday of computing time. Sometimes the code will just time out during SPI exchange resuling in a lockup instead of exiting.

# CG: flopcount (for e/o tmWilson only): t/s: 7.5805e-01 mflops_local: 13630.1 mflops: 13957201.4
# Time for cloverdet monomial derivative: 9.834480e-01 s
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1
MUSPI_GIBarrierPollWithTimeout failed returned rc = -1

I'm pretty sure the condition can be caught and reacted to accordingly. I don't really know how to fix this though. @urbach

kostrzewa commented 11 years ago

Oh, I guess all that's necessary is to finalize MPI in global_barrier() in DirectPut.c (or use the fatal_error() function)

kostrzewa commented 11 years ago

Fixing it right now, will do pull-request on InterleavedNDTwistedClover branch

kostrzewa commented 11 years ago

Is it correct that upon failure of msg_InjFifoInit, the code does not exit?

Oh, I see, the calling code handles the error here.

kostrzewa commented 11 years ago

Darn, I tried my introduction of fatal_error and it seems that the program just locks up and doesn't even get to "exit" or "fatal_error".... I think I'll have to investigate the whole send counter polling...

kostrzewa commented 11 years ago

Okay, I got a bit confused as to where this timeout was happening so the send polling is probably irrelevant. I don't know what's happening though, it seems like the code is just simply unable to exit cleanly...

kostrzewa commented 11 years ago

Do you think it would be possible to use the job script to periodically (say every 2 minutes or so) poll for the error in the output file and cancel the job in case it occurs?

kostrzewa commented 11 years ago

I seem to have discovered a pattern in the failure. All of the failures occured when the mapping was EABCDT and the loadl_shape was 2x1x1x1... Weird, no?

kostrzewa commented 11 years ago

Okay, so I've come up with this, I just don't know if SPI_ERROR will be set correctly as the function "catch_SPI_error" is arguably launched in its own shell.

# @ job_name         = iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009
# @ error            = $(job_name).$(jobid).out
# @ output           = $(job_name).$(jobid).out
# @ environment      = COPY_ALL;
# @ wall_clock_limit = 24:00:00
# @ notification     = always
# @ notify_user      = bartosz.kostrzewa@desy.de
# @ job_type         = bluegene
# @ bg_connectivity  = TORUS
# @ bg_size          = 1024
# @ queue

export NP=1024
export NPN=1
export RUNDIR=runs/nf2
export OMP_NUM_THREADS=64
export NAME="iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009"
export WDIR=${WORK}/${RUNDIR}/${NAME}
#export EXEC=${HOME}/code/tmLQCD.kost/build_4D_hybrid_hs_lemon_unstable/hmc_tm
export EXEC=${HOME}/code/tmLQCD.kost/build_4D_hybrid_hs/hmc_tm
export SPI_ERROR=0

catch_SPI_error() {
  while [ 1 -eq 1 ]; do
    if [[ ! -e ${LOADL_STEP_OUT} ]]; then
      sleep 20
    else
      grep -q 'MUSPI' ${LOADL_STEP_OUT}
      GREPRETVAL=$?
      if test $GREPRETVAL -eq 0; then
        llcancel ${LOADL_JOB_NAME}
        export SPI_ERROR=1
        break 
      else
        sleep 60
      fi
    fi
  done
}

if [[ ! -d ${WDIR} ]]
then
  mkdir -p ${WDIR}
fi

cp hmc.input ${WDIR}

cd ${WDIR}

date

echo loadl shape is $LOADL_BG_SHAPE
export MP=EABCDT

case ${LOADL_BG_SHAPE} in
  2x1x1x1 )
    MP=EABCDT 
  ;;

  1x2x1x1 )
    MP=EBACDT
  ;;

  1x1x2x1 )
    MP=ECABDT
  ;;

  1x1x1x2 )
    MP=EDABCT
  ;;
esac

echo mapping is ${MP}

# run background function to catch the SPI error
catch_SPI_error &

runjob --mapping ${MP} --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node ${NPN} --np ${NP} --cwd ${WDIR} --exe ${EXEC}
RETVAL=$?

if test $SPI_ERROR -eq 1; then
  RETVAL=666
fi

date

exit $RETVAL

kostrzewa commented 11 years ago

okay, tested the abort script with some modifications and it works fine in a test-case, it doesn't even matter if SPI_ERROR is set or not because the output will never be reached, i will add some output in the function instead

kostrzewa commented 11 years ago

this provides a working saver of computer time while we try to fix the bug

export OFILE=${LOADL_STEP_INITDIR}/${LOADL_STEP_OUT}
catch_SPI_error() {
  while [ 1 -eq 1 ]; do
    if [[ ! -e ${OFILE} ]]; then
      sleep 20
    else
      grep -q 'MUSPI' ${OFILE}
      GREPRETVAL=$?
      if test $GREPRETVAL -eq 0; then
        echo "----- SPI error detected in output file, calling llcancel! -----"
        llcancel ${LOADL_STEP_ID}
        break
      else
        sleep 60
      fi
    fi
  done
}

catch_SPI_error &

runjob ...

of course, ${OFILE} may be different depending on what has been set (so whether the output is done in the initial working directory from where the job was submitted)

full job script in

/homeb/pra073/pra07309/runs/nf2/iwa_b2.10-L48T96-csw1.57551-k0.137290-mu0.0009

etmc / tmLQCD

Critical lock-up bug on BG/Q #264