Leeds-MONC / monc

MONC (Leeds fork)
BSD 3-Clause "New" or "Revised" License
5 stars 18 forks source link

Continuation script for Arc4 needed #50

Open cemac-ccs opened 3 years ago

cemac-ccs commented 3 years ago

For PBS systems (and in the forthcoming archer2 changes for slurm systems also), there is a continuation script in the monc/misc folder that contains running scripts for continuation jobs. These make use of dependency chains so that a job can be started after a previous job has completed, using the checkpoint of the previous job.

A similar script can be written for the arc4 sge systems, using the --hold_jid flag which does the same job as the slurm --dependency flag.

The job submission script used by @craigpoku had a function in it which checks for completed jobs as a way of providing this functionality. That and the monc/misc/continuation.sh script would be a good place to start. The relevant code is below:

 --- Checks:

# Check for run completion message in monc output file:
function check_complete() {
  if [ -r "${MONC_OUT}" ] ; then
    grep -q 'Model run complete due to model time' ${MONC_OUT} >& /dev/null
    if [ "${?}" = "0" ] ; then
      echo 'MONC run appears to have completed (exceeded termination time)'
      # Display end time:
      echo "END TIME: $(date)"
      exit 0
    fi
  fi
}
check_complete

# Check for previous checkpoint file:
if [ -r "${MONC_OUT}" ] ; then
  PREV_CKPT_FILE=$(basename $(grep \
                     'Restarted configuration from checkpoint file' \
                     ${MONC_OUT} | egrep -o '[0-9a-zA-Z_/-]+\.nc') \
fi
# Check for most recent existing checkpoint file:
CKPT_FILE=$(basename $(\ls -1v ${CKPT_DIR} | tail -n 1) 2> /dev/null)
# If current chckpoint file is same as previous, give up:
if [ ! -z "${PREV_CKPT_FILE}" ] && [ ! -z "${CKPT_FILE}" ] ; then
  if [ "${PREV_CKPT_FILE}" = "${CKPT_FILE}" ] ; then
    echo "Previous checkpoint file is same as current (${CKPT_FILE})"
    # Display end time:
    echo "END TIME: $(date)"
    exit 1
  fi
fi

# If we have a checkpoint file, restart MONC, else, start from config:
if [ ! -z "${CKPT_FILE}" ] ; then
  MONC_ARGS="--checkpoint=${CKPT_DIR}/${CKPT_FILE}"
else
  MONC_ARGS="--config=${MONC_CONFIG}"
fi
cemac-ccs commented 3 years ago

A script has been modified by @gyoung410 in her fork of this repo. The qsub command in the continuation script remains unchanged and so does not fully work - it still uses PBS syntax - however it shows some of the changes necessary.