ewiger / gc3pie

Automatically exported from code.google.com/p/gc3pie
0 stars 0 forks source link

Batch backends should not ignore *all* errors from the "stat" commands #410

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
When a `squeue`/`qstat`/`bjobs` (a "stat" command in the following)
fails, we currently simply ignore the error and try again with the
`sacct`/`qacct`/`bacct` ("acct" command):

    gc3.gc3libs: DEBUG: SshTransport running `squeue --noheader -o %i|%T|%r -j 46743`...
    gc3.gc3libs: DEBUG: Executed command 'squeue --noheader -o %i|%T|%r -j 46743' on host 'baobab.unige.ch'; exit code: 127
    gc3.gc3libs: DEBUG: The `qstat`/`bjobs` command returned no job information; trying with 'sacct --noheader --parsable --format jobid,exitcode,ncpus,elapsed,totalcpu,submit,start,end,maxrss,maxvmsize -j 46743' instead ...

This is done on purpose, because the "stat" command typically fails
when the job is finished.  Still, this behavior can masks other errors
that occur and prevent sensible reporting.

We would need to parse STDERR and/or check the exit code in order to
detect when the "stat" command is failing because the job is done, and
when some other error condition occurs.

Original issue reported on code.google.com by riccardo.murri@gmail.com on 30 Aug 2013 at 3:59

GoogleCodeExporter commented 9 years ago
I would suggest to at least log the stderr when the command fails (also when 
job not found)

S.

Original comment by sergio.m...@gmail.com on 30 Aug 2013 at 4:14

GoogleCodeExporter commented 9 years ago
| I would suggest to at least log the stderr when the command fails

Ok, I see you've already implemented this.

This is trickier: we need to know what each "acct" command does when a
job is not found.  As usual, the exitcode seems not to be used for
discriminating error conditions; we need to parse the output to see
what happens:

* SGE 6.2u4 @ ocinh64:

        [rmurri@ocinh64 ~]$ qacct -j 999999; echo exitcode: $?
        error: job id 999999 not found
        exitcode: 1

* PBSPro 12 @ idsgi02

        rmurri@idhydralogin01:~> qstat -x 9999999; echo exitcode: $?
        qstat: Unknown Job Id 9999999.idsgi02.uzh.ch
        exitcode: 153

* SLURM 2.5 @ login.gc3: (It seems that SLURM is the "bad" citizen here)

        rmurri@login1:~$ sacct --long -j 999999; echo exitcode: $?
               JobID    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ConsumedEnergy
        ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- --------------
        exitcode: 0

* LSF @ brutus:

        (I can no longer connect to Brutus, so no LSF)

Then what happens in case of wrong invocation or other error?  Again,
exitcode is not relevant (except for PBSPro):

* SGE @ ocinh64:

        [rmurri@ocinh64 ~]$ qacct --foobar; echo exitcode: $?
        GE 6.2u4
        usage: qacct [options]
        [...]
        exitcode: 1

* PBSPro 12 @ idsgi02:

        rmurri@idhydralogin01:~> qstat --foobar; echo exitcode: $?
        qstat: invalid option -- '-'
        [...]
        exitcode: 2

* SLURM 2.5 @ login.gc3:

        $ sacct --foobar; echo exitcode: $?
        sacct: unrecognized option '--foobar'
        exitcode: 1

Original comment by riccardo.murri@gmail.com on 2 Sep 2013 at 11:12