Pallavi-Banerjee21 / votca

Automatically exported from code.google.com/p/votca
0 stars 0 forks source link

functions_dlpoly.sh has to properly check for dlpoly checkpoints and simulation finish #156

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. run csg_inverse (with dlpoly as engine) on a cluster, within a queue
2. if the wall-time limit is reached while dlpoly is running, dlpoly will 
either stop abnormally or exit due to the expiration time set in CONTROL (<= 
job wall-time)
3. by now HISTORY, REVIVE, REVOLD, STATIS files can be all present, but the 
simulation has not been finished in total (could happen to be very close to the 
beginning!)

What is the expected? - something like the following.

As checkpoint files one can use REVIVE and REVOLD.
As a test for the simulation having been actually done in full:

STOP="NO"

  if [[ -s OUTPUT ]]; then

    ERROR="$(grep "terminated due to error" OUTPUT)"
    NPASSED="$(grep "run terminated after" OUTPUT | awk -F ' ' '{print $'4'}')"
    NSTEPS="$(grep 'selected number of timesteps' OUTPUT | awk -F ' ' '{print $'5'}')"

    if [[ -n "${ERROR}" ]]; then
    #The job has STOPPED due to ERROR

      echo " "
      echo "$(basename $(readlink -f $0)):: exit on recognised error - check the log file(s)"
      echo " "
      STOP="YES"

    elif [[ -n "${NPASSED}" ]]; then
    #The job has gone well

      if (( "${NSTEPS}" > "${NPASSED}" )); then
      #The job has not been finished and can be restarted
        echo " "
        echo "$(basename $(readlink -f $0)):: n_steps = ${NSTEPS} > n_passed = ${NPASSED} -> restarting simulation"
        echo " "
      else
      #The job has been finished normally (no restart)
        echo " "
        echo "$(basename $(readlink -f $0)):: n_steps = ${NSTEPS} <= n_passed = ${NPASSED} -> simulation done"
        echo " "
        STOP="YES"
      fi
    fi

    cat OUTPUT >> OUTPUT-tot
  fi

Original issue reported on code.google.com by abruk...@googlemail.com on 1 Mar 2014 at 11:51

GoogleCodeExporter commented 8 years ago
Next thing, I will have a look at it myself!

Sorry, did not see where to set the owner, obviously it's me :)

Original comment by abruk...@googlemail.com on 1 Mar 2014 at 11:56

GoogleCodeExporter commented 8 years ago
This looks very brittle to me. Could you just check if the simulation is not 
finished (REVCON don't exist) and if a checkpoint exists and run dl_poly again? 
If dlpoly runs, great, if not, the user has to do some work by hand anyway.

And the check should be split into checkpoint_exist() and simulation_finish() 
(see inverse.sh line 250 for the logic).

Technically:
- readlink, why?
- basename $0 -> ${0##*/}
- grep XXX file | awk '{..}'-> awk '/XXX/{..}' file

Original comment by christop...@gmail.com on 2 Mar 2014 at 1:39

GoogleCodeExporter commented 8 years ago
REVCON and REVIVE are periodically created during the simulation run, and are 
needed for restarting an abnormally stopped simulation job. So, these two files 
are the "checkpoint" for DL_POLY, and do not tell if the job went all the way 
to the end.

The only way to be sure that the simulation has been actually finished is to 
check the contents of OUTPUTm as I have shown in that extract (from a working 
self-resubmission script). 

I agree certain things there can be simplified, or done in a more compact way.

Original comment by abruk...@googlemail.com on 3 Mar 2014 at 1:25

GoogleCodeExporter commented 8 years ago
Correction: REVCON and REVIVE are periodically updated...

Original comment by abruk...@googlemail.com on 3 Mar 2014 at 1:26

GoogleCodeExporter commented 8 years ago
Christoph, what purpose is the code below supposed to serve?
My understanding is that finding HISTORY is considered as a "simulation 
finished" flag. But why "touch .dlph/.dlpf", do you check them anywhere later?

simulation_finish() { #checks if simulation is finished
  local traj topol
  if [[ -f "HISTORY" ]]; then
    #hacky workaround as topol/traj is called '.dlph/.dlpf'
    traj=$(csg_get_property cg.inverse.dlpoly.traj)
    critical touch $traj
    topol=$(csg_get_property cg.inverse.dlpoly.topol)
    critical touch $topol
    return 0
  fi
  return 1
}

Original comment by abruk...@googlemail.com on 3 Mar 2014 at 1:46

GoogleCodeExporter commented 8 years ago
The following functions should be sufficient (?).
===
simulation_finish() { #checks if simulation is finished
  local nneeded npassed
  [[ ! -f "HISTORY" ]] && return 1
  [[ ! -s "OUTPUT"  ]] && return 1
  nneeded=$(awk '/selected number of timesteps/{print $'5'}' OUTPUT)
  npassed=$(awk '/run terminated after/{print $'4'}' OUTPUT)
  [[ $npassed -lt $nneeded ]] && return 1
  return 0
}
export -f simulation_finish

checkpoint_exist() { #check if a checkpoint exists (REVIVE _and_ REVCON - both 
are needed!)
  #support for checkpoint
  local checkpoint check
  checkpoint="($(csg_get_property --alow-empty cg.inverse.dlpoly.checkpoint))"
  [[ -n $checkpoint ]] && checkpoint="REVIVE REVCON"
  checkpoint=($checkpoint)
  for check in "${checkpoint[@]}"; do
    [[ ! -f ${check} ]] && return 1
  done
  return 0
}
export -f checkpoint_exist
===

Original comment by abruk...@googlemail.com on 3 Mar 2014 at 3:04

GoogleCodeExporter commented 8 years ago
Looks not too bad, but I have the following comments:
1.) touching of .dlph/.dlpf is needed as many update script check the existence 
of the topology and trajectory file.
2.) simulation_finish(): only if cg.inverse.dlpoly.traj == .dlph, HISTORY 
should be used
3.) simulation_finish(): what is the point of quoting 4 outside of the string? 
$variable isn't expanded in single quotes. (echo '$HOME' vs echo "$HOME")
4.) simulation_finish(): return code is automatically the result of the last 
command, so "[[ $npassed -ge $nneeded ]]" would do it.
5.) checkpoint_exist(): defaults should do into csg_defaults.xml, also putting 
it into an array make not much sense (would help with spaces in the filename, 
but csg_get_property doesn't handle space in filename).

We could improve the .dlp* hackery by moving HISTORY to HISTORY.dlph etc. in 
run_dlpoly.sh, but then we loss dlpoly's naming scheme.

Original comment by christop...@gmail.com on 4 Mar 2014 at 1:08

GoogleCodeExporter commented 8 years ago
1/2) OK, I see now how "touch .dlph/.dlpf" works (actually encountered an error 
when not doing it). So I simply reinsert those commands after the initial 
checks on HISTORY and OUTPUT. I think having zero-size "hidden" files as 
"finished" flags for checking by votca scripts is alright.
3) OK, removed the quotes.
4) I did not know that the return statement is auto-assumed by the last 
command, but I would still use "return" for clarity (not that any user is going 
to look into it, though).
5) yep, I put "REVIVE REVCON" in csg_defaults.xml, then the test [[-n 
$checkpoint]] won't be needed. Did you mean we can also remove: 
checkpoint=($checkpoint)?

Original comment by abruk...@googlemail.com on 4 Mar 2014 at 1:31

GoogleCodeExporter commented 8 years ago

checkpoint_exist() { #check if REVIVE REVCON exist - from in csg_defaults.xml
  #support for checkpoints
  local checkpoint check
  checkpoint="($(csg_get_property cg.inverse.dlpoly.checkpoint))"
  for check in $checkpoint; do
    [[ ! -f ${check} ]] && return 1
  done
  echo "DL_POLY checkpoint present (REVIVE REVCON found)"
  return 0
}
export -f checkpoint_exist

Original comment by abruk...@googlemail.com on 4 Mar 2014 at 1:42

GoogleCodeExporter commented 8 years ago
Checked its working fine and committed/pushed along with some earlier bug fixes 
(see my clone)

Original comment by abruk...@googlemail.com on 4 Mar 2014 at 4:29

GoogleCodeExporter commented 8 years ago
fixed

How do I close or discard it?

Original comment by abruk...@googlemail.com on 21 Mar 2014 at 12:51

GoogleCodeExporter commented 8 years ago
You were missing the "EditIssue" permission.

Original comment by christop...@gmail.com on 21 Mar 2014 at 2:55