ewiger / gc3pie

Automatically exported from code.google.com/p/gc3pie
0 stars 0 forks source link

LSF Backend is not parsing stat output correctly #454

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Submit a job to LSF head node
2. gstat to find out the status of the job

What version of the product are you using? On what operating system?
GC3Pie 2.2, on Linux Ubuntu 14.04

stdout raw data as comes into the lsf.py _parse_stat_output(stdout) function:
--Start
Job <2073>, Job Name <GRunApplication.0>, User <markmon>, Project <default>, St
                          atus <EXIT>, Queue <normal>, Command <sh -c inputfile
                          .txt>
Mon Aug  4 12:28:51 2014: Submitted from host <pa64.dri.edu>, CWD <$HOME/.gc3pi
                          e_jobs/lrms_job.5gDDxlxcty>, Specified CWD <$HOME/.gc
                          3pie_jobs/lrms_job.5gDDxlxcty/.>, Output File (overwr
                          ite) <stdout.txt>, Error File (overwrite) <stderr.txt
                          >, Requested Resources <rusage[mem=2000]>, Login Shel
                          l </bin/sh>;

 RUNLIMIT                
 480.0 min of pa54.dri.edu
Mon Aug  4 12:28:51 2014: Started on <pa54.dri.edu>, Execution Home </home/mark
                          mon>, Execution CWD </home/markmon/.gc3pie_jobs/lrms_
                          job.5gDDxlxcty/.>;
Mon Aug  4 12:28:51 2014: Exited with exit code 127. The CPU time used is 0.1 s
                          econds.
Mon Aug  4 12:28:51 2014: Completed <exit>.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] rusage[mem=2000.00]
 Effective: select[type == local] order[r15s:pg] rusage[mem=2000.00] 
--END

The below _parse_stat_output code is the issue. It tries to process the above 
input and format it nicely. It does not succeed. There are breaks in the text 
that the below function does not remove, which causes 
LsfLrms._status_re.search(stdout) to fail to find the 'Status' section of the 
file.

--Start
lines = [ ]
for line in stdout.split('\n'):
    if len(line) == 0:
        continue
    if line.startswith(LsfLrms._CONTINUATION_LINE_START):
        lines[-1] += line[len(LsfLrms._CONTINUATION_LINE_START):]
    else:
        lines.append(line)

# now rebuild stdout by joining the reconstructed lines
stdout = str.join('\n', lines)
--End

stdout after the above code executes. Notice the break in the word 'Status', 
along with other text breaks.

--Start
 Job <2073>, Job Name <GRunApplication.0>, User <markmon>, Project <default>, St     atus <EXIT>, Queue <normal>, Command <sh -c inputfile     .txt>
Mon Aug  4 12:28:51 2014: Submitted from host <pa64.dri.edu>, CWD <$HOME/.gc3pi 
    e_jobs/lrms_job.5gDDxlxcty>, Specified CWD <$HOME/.gc     
3pie_jobs/lrms_job.5gDDxlxcty/.>, Output File (overwr     ite) <stdout.txt>, 
Error File (overwrite) <stderr.txt     >, Requested Resources 
<rusage[mem=2000]>, Login Shel     l </bin/sh>;
 RUNLIMIT                
 480.0 min of pa54.dri.edu
Mon Aug  4 12:28:51 2014: Started on <pa54.dri.edu>, Execution Home </home/mark 
    mon>, Execution CWD </home/markmon/.gc3pie_jobs/lrms_     job.5gDDxlxcty/.>;
Mon Aug  4 12:28:51 2014: Exited with exit code 127. The CPU time used is 0.1 s 
    econds.
Mon Aug  4 12:28:51 2014: Completed <exit>.
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  
 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] rusage[mem=2000.00]
 Effective: select[type == local] order[r15s:pg] rusage[mem=2000.00]  
--End

Original issue reported on code.google.com by markjmon...@gmail.com on 4 Aug 2014 at 7:45

GoogleCodeExporter commented 9 years ago
Adding spaces to the _CONTINUATION_LINE_START seems to have fixed the issue for 
my lsf. Maybe a more robust way of parsing the lsf stat out needs to be 
developed?

Old LsfLrms._CONTINUATION_LINE_START that errors:     _CONTINUATION_LINE_START 
= '                     '
New LsfLrms._CONTINUATION_LINE_START that works:  _CONTINUATION_LINE_START = '  
                        '

Original comment by markjmon...@gmail.com on 4 Aug 2014 at 7:58

GoogleCodeExporter commented 9 years ago
I believe this should be fixed in SVN r3985.

Could you please install or upgrade to the latest "trunk" version and
try?  If it solves the issue, please close the issue by setting the
status to "Fixed".

Many thanks for reporting, debugging the issue, and providing sample
`bjobs` output for a test case!

Original comment by riccardo.murri@gmail.com on 5 Aug 2014 at 11:39

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Update issue 454
Status: Fixed
I have tested the fix and it works.

Original comment by markjmon...@gmail.com on 11 Aug 2014 at 5:31

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 11 Aug 2014 at 8:12