biouno / pbs-plugin

Jenkins PBS plug-in
http://biouno.org
9 stars 2 forks source link

Fatal when submitting jobs #4

Open mabahj opened 9 years ago

mabahj commented 9 years ago

I get a fatal (console below) when I try to submit jobs. This error does not contain any error output. Jenkins 1.599. PBS Plug-in 0.2. Master is running on Windows 7, slaves on linux. SGE grid. I am able to post a job to SGE manually if I copy and paste the command line shown in the console output. qsub accepts the command. But the job fails because the script created (/temp/jenkins/pbs/jenkinsPBS_2918185274526175465/script) does not have write permission.

Error message:

Created working directory '/temp/jenkins/pbs/jenkinsPBS_2918185274526175465' with permissions 'rwx------'
PBS script: /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/script
FATAL: Failed to submit job script with command line 'qsub -e /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/err -o /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/out /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/script'. Error output: 
ERROR: Failed to submit job script with command line 'qsub -e /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/err -o /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/out /temp/jenkins/pbs/jenkinsPBS_2918185274526175465/script'. Error output: 
Finished: FAILURE
kinow commented 9 years ago

Hmmm, tricky part will be to reproduce this issue. All I have for testing is a VirtualBox/Vagrant PBS Torque box. Are you aware of some way to reproduce this issue with an environment similar to yours?

mabahj commented 9 years ago

Well. You could set up SGE, which is free. But I could not demand anything here. Another option could be to add some more logging output. I've enabled full logging in Jenkins and the only PBS entry I see is this:

apr 17, 2015 8:45:42 AM FINE hudson.remoting.Channel
Received UserRequest:jenkins.plugins.pbs.tasks.Qsub@293c71

If you add some output to the log, then it should be easier to see what happens?

Job config:

<?xml version="1.0" encoding="UTF-8"?>
<project>
  <actions/>
  <description>https://groups.google.com/forum/#!topic/biouno-users/fWBUIOiWjUg

http://biouno.org/jenkins-plugins.html

https://github.com/biouno/pbs-plugin/releases</description>
  <keepDependencies>false</keepDependencies>
  <properties>
    <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents@1.8.4">
      <maxConcurrentPerNode>0</maxConcurrentPerNode>
      <maxConcurrentTotal>0</maxConcurrentTotal>
      <categories>
        <string>slow_jobs</string>
      </categories>
      <throttleEnabled>false</throttleEnabled>
      <throttleOption>category</throttleOption>
    </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
  </properties>
  <scm class="hudson.scm.NullSCM"/>
  <assignedNode>SGE</assignedNode>
  <canRoam>false</canRoam>
  <disabled>false</disabled>
  <blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
  <blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
  <triggers/>
  <concurrentBuild>false</concurrentBuild>
  <builders>
    <jenkins.plugins.pbs.PBSBuilder plugin="pbs@0.2">
      <script>#!/bin/bash
echo "=========================================="
echo "Sleeping on grid computer $(hostname)"
sleep 60
echo "Done"
echo "=========================================="</script>
    </jenkins.plugins.pbs.PBSBuilder>
  </builders>
  <publishers/>
  <buildWrappers/>
</project>

Node config:

<?xml version="1.0" encoding="UTF-8"?>
<jenkins.plugins.pbs.slaves.PBSSlave plugin="pbs@0.2">
  <name>SGE</name>
  <description>Son of Grid</description>
  <remoteFS>/work/jenkins/jenkins_test_grid_slave</remoteFS>
  <numExecutors>2</numExecutors>
  <mode>EXCLUSIVE</mode>
  <retentionStrategy class="hudson.slaves.RetentionStrategy$Always"/>
  <launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves@1.9">
    <host>myhost</host>
    <port>22</port>
    <credentialsId>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx</credentialsId>
    <maxNumRetries>0</maxNumRetries>
    <retryWaitTime>0</retryWaitTime>
  </launcher>
  <label/>
  <nodeProperties>
    <hudson.slaves.EnvironmentVariablesNodeProperty>
      <envVars serialization="custom">
        <unserializable-parents/>
        <tree-map>
          <default>
            <comparator class="hudson.util.CaseInsensitiveComparator"/>
          </default>
          <int>8</int>
          <string>GridEngRoot</string>
          <string>/cad/gnu/sge_test</string>
          <string>PATH</string>
          <string>/usr/bin:/usr/sbin:/bin:/usr/bin/X11:/usr/local/etc/jre/current/bin:/pri/jenkins/bin:/cad/gnu/sge_test/bin:/cad/gnu/sge_test/bin/lx-amd64</string>
          <string>SGE_ARCH</string>
          <string>lx-amd64</string>
          <string>SGE_CELL</string>
          <string>default</string>
          <string>SGE_CLUSTER_NAME</string>
          <string>sim1</string>
          <string>SGE_EXECD_PORT</string>
          <string>6445</string>
          <string>SGE_QMASTER_PORT</string>
          <string>6444</string>
          <string>SGE_ROOT</string>
          <string>/cad/gnu/sge_test</string>
        </tree-map>
      </envVars>
    </hudson.slaves.EnvironmentVariablesNodeProperty>
  </nodeProperties>
  <userId>jenkins</userId>
</jenkins.plugins.pbs.slaves.PBSSlave>
kinow commented 9 years ago

Note to self: test this docker image when debugging this issue https://registry.hub.docker.com/u/agaveapi/torque/

kinow commented 9 years ago

The docker image worked. Tried with a job configuration that comes with the container. Will try your job configuration. Probably while working on #9 I'll comment here what's wrong or how you could get your set up working.