Closed zichner closed 8 years ago
Can't reproduce this:
(venv) jenkins@ip-172-31-29-42:~/toil$ TOIL_GRIDENGINE_PE=smp python HelloWorld.py --batchSystem gridEngine --disableCaching --logDebug ./jobStore
ip-172-31-29-42 2016-08-23 22:08:26,627 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-29-42 2016-08-23 22:08:26,627 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-29-42 2016-08-23 22:08:26,629 MainThread INFO toil.jobStores.fileJobStore: Path to job store directory is '/home/jenkins/toil/jobStore'.
ip-172-31-29-42 2016-08-23 22:08:26,629 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: 'f77076fb-a621-4820-9689-5dfd23ce6d35'
ip-172-31-29-42 2016-08-23 22:08:26,631 MainThread INFO toil.common: Using the gridengine batch system
ip-172-31-29-42 2016-08-23 22:08:26,639 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([])
ip-172-31-29-42 2016-08-23 22:08:26,639 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:26,639 MainThread INFO toil.common: Written the environment for the jobs to the environment file
ip-172-31-29-42 2016-08-23 22:08:26,640 MainThread INFO toil.common: Caching all jobs in job store
ip-172-31-29-42 2016-08-23 22:08:26,640 MainThread INFO toil.common: 0 jobs downloaded.
ip-172-31-29-42 2016-08-23 22:08:26,665 MainThread INFO toil.realtimeLogger: Real-time logging disabled
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: (Re)building internal scheduler state
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread DEBUG toil.leader: Found job to run: 3/A/job_93tWo, with command: True, with checkpoint: False, with services: False, with stack: False
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
ip-172-31-29-42 2016-08-23 22:08:26,699 MainThread INFO toil.leader: Starting the main loop
ip-172-31-29-42 2016-08-23 22:08:26,699 MainThread DEBUG toil.leader: Built the jobs list, currently have 1 jobs to update and 0 jobs issued
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.leader: Updating status of job: 3/A/job_93tWo with result status: 0
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.batchSystems.gridengine: Issued the job command: /home/jenkins/toil/venv/bin/_toil_worker /home/jenkins/toil/jobStore 3/A/job_93tWo with job id: 0
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.leader: Issued job with job store ID: 3/A/job_93tWo and job batch system ID: 0 and cores: 2.00, disk: 3221225472.00, and memory: 2147483648.00
ip-172-31-29-42 2016-08-23 22:08:27,641 Thread-1 DEBUG toil.batchSystems.gridengine: Running ['qsub', '-b', 'y', '-terse', '-j', 'y', '-cwd', '-o', '/dev/null', '-e', '/dev/null', '-N', 'toil_job_0', '-hard', '-l', 'vf=2097152K,h_vmem=2097152K', '-pe', 'smp', '2', '/home/jenkins/toil/venv/bin/_toil_worker /home/jenkins/toil/jobStore 3/A/job_93tWo']
ip-172-31-29-42 2016-08-23 22:08:27,649 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:27,703 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:27,708 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: ---TOIL WORKER OUTPUT LOG---
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: Next available file descriptor: 5
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.worker:Next available file descriptor: 5
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: Parsed jobWrapper
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.worker:Parsed jobWrapper
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: Got a command to run: _toil 3/A/job_93tWo/g/tmpXJtNXK.tmp /home/jenkins/toil HelloWorld
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.worker:Got a command to run: _toil 3/A/job_93tWo/g/tmpXJtNXK.tmp /home/jenkins/toil HelloWorld
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: WARNING:toil.resource:Can't find resource for leader path '/home/jenkins/toil'
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld')
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Getting FunctionWrappingJob from module toil.job.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Getting defaultdict from module collections.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Getting list from module __builtin__.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Getting ModuleDescriptor from module toil.resource.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Getting set from module __builtin__.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.job:Loading user function helloWorld from module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: WARNING:toil.resource:Can't find resource for leader path '/home/jenkins/toil'
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld')
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: DEBUG:toil.worker:Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: Worker log can be found at /tmp/2.1.all.q/toil-f77076fb-a621-4820-9689-5dfd23ce6d35/tmpoPNV6y. Set --cleanWorkDir to retain this log
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: INFO:toil.worker:Worker log can be found at /tmp/2.1.all.q/toil-f77076fb-a621-4820-9689-5dfd23ce6d35/tmpoPNV6y. Set --cleanWorkDir to retain this log
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: Finished running the chain of jobs on this node, we ran for a total of 0.002031 seconds
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo: INFO:toil.worker:Finished running the chain of jobs on this node, we ran for a total of 0.002031 seconds
ip-172-31-29-42 2016-08-23 22:08:28,709 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:28,715 Thread-1 DEBUG toil.batchSystems.gridengine: Exit Status: '0'
ip-172-31-29-42 2016-08-23 22:08:28,716 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([])
ip-172-31-29-42 2016-08-23 22:08:28,716 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:28,716 MainThread DEBUG toil.batchSystems.gridengine: UpdatedJobsQueue Item: (0, 0)
ip-172-31-29-42 2016-08-23 22:08:28,716 MainThread DEBUG toil.leader: Batch system is reporting that the jobWrapper with batch system ID: 0 and jobWrapper store ID: 3/A/job_93tWo ended successfully
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: No jobs left to run so exiting.
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: Finished the main loop
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: Waiting for stats and logging collator thread to finish ...
ip-172-31-29-42 2016-08-23 22:08:29,206 MainThread INFO toil.leader: ... finished collating stats and logs. Took 0.489470005035 seconds
ip-172-31-29-42 2016-08-23 22:08:29,207 MainThread INFO toil.leader: Waiting for service manager thread to finish ...
ip-172-31-29-42 2016-08-23 22:08:29,699 Thread-2 DEBUG toil.leader: Received signal to quit starting services.
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread INFO toil.leader: ... finished shutting down the service manager. Took 0.492269039154 seconds
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread INFO toil.leader: Finished toil run successfully
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread DEBUG toil.common: Shutting down batch system ...
ip-172-31-29-42 2016-08-23 22:08:29,717 Thread-1 DEBUG toil.batchSystems.gridengine: Received queue sentinel.
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread DEBUG toil.common: ... finished shutting down the batch system in 0.0180327892303 seconds.
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread INFO toil.common: Attempting to delete the job store
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread INFO toil.common: Successfully deleted the job store
Hello, world!, here's a message: You did it!
Same commit. Only difference is that I need to set TOIL_GRIDENGINE_PE=smp which you might also want to try.
Looking at the difference between your log and mine, I think the problem is in getJobExitCode
.
Can you apply this patch (or make the change manually) and run again:
diff --git a/src/toil/batchSystems/gridengine.py b/src/toil/batchSystems/gridengine.py
index 02c2f39..6849c20 100644
--- a/src/toil/batchSystems/gridengine.py
+++ b/src/toil/batchSystems/gridengine.py
@@ -189,6 +189,7 @@ class Worker(Thread):
args = ["qacct", "-j", str(job)]
if task is not None:
args.extend(["-t", str(task)])
+ logger.debug("Running %r", args)
process = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in process.stdout:
if line.startswith("failed") and int(line.split()[1]) == 1:
While HelloWorld.py
is hanging at the end, can you run the qacct
command line that the added debug statement prints out and post its output here?
Thank you very much for your input!
The problem was indeed related to qacct
. On our SGE, qacct
was not properly set up so that qacct -j 123
returned error: job id 123 not found
(even after the job finished).
With qacct
working correctly, Toil is running properly.
Thanks again!
Glad it works now.
I want to get that logging statement in permanently. Reopening for that reason.
Using the latest git version of Toil (last commit: dbbd4ed) with a gridengine batch system the HelloWorld pipeline runs, but does not terminate:
The last two lines get printed again and again. After ~15min, I killed the pipeline.
Do you have any idea what the problem might be?
Thank you very much for you help and the great work!