abhilekhsingh / gc3pie

Automatically exported from code.google.com/p/gc3pie
0 stars 0 forks source link

Have separate error codes depending on the cause of job failure #163

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1) What places of the model would need to be changed?

Basically, two places:

* `Run.Signal` (new signal/error codes to be introduced)
* the backends/LRMSes (detect error conditions and flag the appropriate code)

2) Why do we need this?

In order to implement meaningful resubmission/retry, we need to
distinguish the cause of job failure in a detailed way: it makes no
sense to resubmit a job that has was failed because of wrong
requirements (e.g., memory limit too tight), but it makes perfect
sense to retry a job that failed because of the remote site's problem
(and submit it to a different site).  

The current system does not discriminate enough: most jobs fail with
signal 124 (generic batch system failure).

The problem is that ARC and SGE do not provide any way of reliably
getting this information.  We currently have some heuristics in the
code, but that is just used only to generate a record in the job
history/log.

3) What's the proposal?

Be as specific as we can in reporting errors.  Trust ARC and SGE, but
use heuristics if no authoritative information is available or if it's
too generic.

The following errors need to be assigned separate codes/signals:

* memory limit exceeded (signal 120); heuristics: if "used memory" >=
  "requested memory"; possible problem: SGE gives bogus usage
  statistics if job has failed.

* wall-clock time limit exceeded (signal 119); heuristics: if "used
  w-c time" >= "requested w-c time"

* could not run command (signal 118); heuristics: if exit code == 127
  (code 127 is what the shell uses if you try to run a non-existent or
  non-executable file)

* remote site didn't deliver (signal 117), e.g., there is not enough
  space to transfer all input files, or could not qsub/bsub/whatever the job.

  Note signal 117 differs from the generic batch system failure
  (signal 124) in that you can assume that "it's the remote site
  fault" and so it makes sense to resubmit to a different site.

What other errors should be listed here?

Original issue reported on code.google.com by riccardo.murri@gmail.com on 22 Mar 2011 at 5:49

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 22 Mar 2011 at 6:08

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 1 Jul 2011 at 2:41

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 17 Aug 2012 at 1:22