DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

HelloWorld pipeline doesn't terminate when using gridengine batch system #1114

Closed zichner closed 8 years ago

zichner commented 8 years ago

Using the latest git version of Toil (last commit: dbbd4ed) with a gridengine batch system the HelloWorld pipeline runs, but does not terminate:

> python HelloWorld.py --batchSystem gridEngine --disableCaching --logDebug ./jobStore
pc 2016-08-23 17:04:28,681 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
pc 2016-08-23 17:04:28,682 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
pc 2016-08-23 17:04:28,724 MainThread INFO toil.jobStores.fileJobStore: Path to job store directory is '/data/cwl-test/jobStore'.
pc 2016-08-23 17:04:28,728 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: '49203713-60f2-4bb0-87b6-b88d0bd7d16a'
pc 2016-08-23 17:04:28,767 MainThread INFO toil.common: Using the gridengine batch system
pc 2016-08-23 17:04:28,853 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([])
pc 2016-08-23 17:04:28,855 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:28,868 MainThread INFO toil.common: Written the environment for the jobs to the environment file
pc 2016-08-23 17:04:28,869 MainThread INFO toil.common: Caching all jobs in job store
pc 2016-08-23 17:04:28,873 MainThread INFO toil.common: 0 jobs downloaded.
pc 2016-08-23 17:04:28,932 MainThread INFO toil.realtimeLogger: Real-time logging disabled
pc 2016-08-23 17:04:29,601 MainThread INFO toil.leader: (Re)building internal scheduler state
pc 2016-08-23 17:04:29,601 MainThread DEBUG toil.leader: Found job to run: V/y/jobNNHKKt, with command: True, with checkpoint: False, with  services: False, with stack: False
pc 2016-08-23 17:04:29,603 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
pc 2016-08-23 17:04:29,603 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
pc 2016-08-23 17:04:29,604 MainThread INFO toil.leader: Starting the main loop
pc 2016-08-23 17:04:29,604 MainThread DEBUG toil.leader: Built the jobs list, currently have 1 jobs to update and 0 jobs issued
pc 2016-08-23 17:04:29,605 MainThread DEBUG toil.leader: Updating status of job: V/y/jobNNHKKt with result status: 0
pc 2016-08-23 17:04:29,608 MainThread DEBUG toil.batchSystems.gridengine: Issued the job command: /data/cwl-test/toil_git/bin/_toil_worker /data/cwl-test/jobStore V/y/jobNNHKKt with job id: 0
pc 2016-08-23 17:04:29,609 MainThread DEBUG toil.leader: Issued job with job store ID: V/y/jobNNHKKt and job batch system ID: 0 and cores: 1.00, disk: 3221225472.00, and memory: 2147483648.00
pc 2016-08-23 17:04:29,856 Thread-1 DEBUG toil.batchSystems.gridengine: Running ['qsub', '-b', 'y', '-terse', '-j', 'y', '-cwd', '-o', '/dev/null', '-e', '/dev/null', '-N', 'toil_job_0', '-hard', '-l', 'vf=2097152K,h_vmem=2097152K', '/data/cwl-test/toil_git/bin/_toil_worker /data/cwl-test/jobStore V/y/jobNNHKKt']
pc 2016-08-23 17:04:29,922 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:30,853 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:31,476 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:32,478 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:33,325 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:34,326 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:35,200 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:36,201 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:37,085 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:38,087 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:38,976 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:39,786 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    ---TOIL WORKER OUTPUT LOG---
pc 2016-08-23 17:04:39,787 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Next available file descriptor: 5
pc 2016-08-23 17:04:39,787 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.worker:Next available file descriptor: 5
pc 2016-08-23 17:04:39,787 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Parsed jobWrapper
pc 2016-08-23 17:04:39,788 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.worker:Parsed jobWrapper
pc 2016-08-23 17:04:39,788 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Got a command to run: _toil V/y/jobNNHKKt/g/tmpqcMTm_.tmp /data/cwl-test HelloWorld
pc 2016-08-23 17:04:39,788 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.worker:Got a command to run: _toil V/y/jobNNHKKt/g/tmpqcMTm_.tmp /data/cwl-test HelloWorld
pc 2016-08-23 17:04:39,788 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/data/cwl-test', name='HelloWorld').
pc 2016-08-23 17:04:39,789 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    WARNING:toil.resource:Can't find resource for leader path '/data/cwl-test'
pc 2016-08-23 17:04:39,789 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/data/cwl-test', name='HelloWorld')
pc 2016-08-23 17:04:39,789 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/data/cwl-test', name='HelloWorld').
pc 2016-08-23 17:04:39,789 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Getting FunctionWrappingJob from module toil.job.
pc 2016-08-23 17:04:39,790 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Getting defaultdict from module collections.
pc 2016-08-23 17:04:39,790 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Getting list from module __builtin__.
pc 2016-08-23 17:04:39,790 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Getting ModuleDescriptor from module toil.resource.
pc 2016-08-23 17:04:39,790 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Getting set from module __builtin__.
pc 2016-08-23 17:04:39,791 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.job:Loading user function helloWorld from module ModuleDescriptor(dirPath='/data/cwl-test', name='HelloWorld').
pc 2016-08-23 17:04:39,791 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    WARNING:toil.resource:Can't find resource for leader path '/data/cwl-test'
pc 2016-08-23 17:04:39,791 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/data/cwl-test', name='HelloWorld')
pc 2016-08-23 17:04:39,791 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
pc 2016-08-23 17:04:39,791 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    DEBUG:toil.worker:Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
pc 2016-08-23 17:04:39,792 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Worker log can be found at /tmp/1126737.1.all.q/toil-49203713-60f2-4bb0-87b6-b88d0bd7d16a/tmpn7e5BJ. Set --cleanWorkDir to retain this log
pc 2016-08-23 17:04:39,792 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    INFO:toil.worker:Worker log can be found at /tmp/1126737.1.all.q/toil-49203713-60f2-4bb0-87b6-b88d0bd7d16a/tmpn7e5BJ. Set --cleanWorkDir to retain this log
pc 2016-08-23 17:04:39,792 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    Finished running the chain of jobs on this node, we ran for a total of 0.052960 seconds
pc 2016-08-23 17:04:39,792 Thread-3 INFO toil.leader: V/y/jobNNHKKt:    INFO:toil.worker:Finished running the chain of jobs on this node, we ran for a total of 0.052960 seconds
pc 2016-08-23 17:04:39,978 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:40,862 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:41,864 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:42,771 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
pc 2016-08-23 17:04:43,773 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
pc 2016-08-23 17:04:44,651 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s

The last two lines get printed again and again. After ~15min, I killed the pipeline.

Do you have any idea what the problem might be?

Thank you very much for you help and the great work!

hannes-ucsc commented 8 years ago

Can't reproduce this:

(venv) jenkins@ip-172-31-29-42:~/toil$ TOIL_GRIDENGINE_PE=smp python HelloWorld.py --batchSystem gridEngine --disableCaching --logDebug ./jobStore
ip-172-31-29-42 2016-08-23 22:08:26,627 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-29-42 2016-08-23 22:08:26,627 MainThread INFO toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-29-42 2016-08-23 22:08:26,629 MainThread INFO toil.jobStores.fileJobStore: Path to job store directory is '/home/jenkins/toil/jobStore'.
ip-172-31-29-42 2016-08-23 22:08:26,629 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: 'f77076fb-a621-4820-9689-5dfd23ce6d35'
ip-172-31-29-42 2016-08-23 22:08:26,631 MainThread INFO toil.common: Using the gridengine batch system
ip-172-31-29-42 2016-08-23 22:08:26,639 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([])
ip-172-31-29-42 2016-08-23 22:08:26,639 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:26,639 MainThread INFO toil.common: Written the environment for the jobs to the environment file
ip-172-31-29-42 2016-08-23 22:08:26,640 MainThread INFO toil.common: Caching all jobs in job store
ip-172-31-29-42 2016-08-23 22:08:26,640 MainThread INFO toil.common: 0 jobs downloaded.
ip-172-31-29-42 2016-08-23 22:08:26,665 MainThread INFO toil.realtimeLogger: Real-time logging disabled
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: (Re)building internal scheduler state
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread DEBUG toil.leader: Found job to run: 3/A/job_93tWo, with command: True, with checkpoint: False, with  services: False, with stack: False
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs
ip-172-31-29-42 2016-08-23 22:08:26,698 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run
ip-172-31-29-42 2016-08-23 22:08:26,699 MainThread INFO toil.leader: Starting the main loop
ip-172-31-29-42 2016-08-23 22:08:26,699 MainThread DEBUG toil.leader: Built the jobs list, currently have 1 jobs to update and 0 jobs issued
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.leader: Updating status of job: 3/A/job_93tWo with result status: 0
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.batchSystems.gridengine: Issued the job command: /home/jenkins/toil/venv/bin/_toil_worker /home/jenkins/toil/jobStore 3/A/job_93tWo with job id: 0 
ip-172-31-29-42 2016-08-23 22:08:26,700 MainThread DEBUG toil.leader: Issued job with job store ID: 3/A/job_93tWo and job batch system ID: 0 and cores: 2.00, disk: 3221225472.00, and memory: 2147483648.00
ip-172-31-29-42 2016-08-23 22:08:27,641 Thread-1 DEBUG toil.batchSystems.gridengine: Running ['qsub', '-b', 'y', '-terse', '-j', 'y', '-cwd', '-o', '/dev/null', '-e', '/dev/null', '-N', 'toil_job_0', '-hard', '-l', 'vf=2097152K,h_vmem=2097152K', '-pe', 'smp', '2', '/home/jenkins/toil/venv/bin/_toil_worker /home/jenkins/toil/jobStore 3/A/job_93tWo']
ip-172-31-29-42 2016-08-23 22:08:27,649 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:27,703 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:27,708 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    ---TOIL WORKER OUTPUT LOG---
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Next available file descriptor: 5
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.worker:Next available file descriptor: 5
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Parsed jobWrapper
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.worker:Parsed jobWrapper
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Got a command to run: _toil 3/A/job_93tWo/g/tmpXJtNXK.tmp /home/jenkins/toil HelloWorld
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.worker:Got a command to run: _toil 3/A/job_93tWo/g/tmpXJtNXK.tmp /home/jenkins/toil HelloWorld
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    WARNING:toil.resource:Can't find resource for leader path '/home/jenkins/toil'
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld')
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,203 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Getting FunctionWrappingJob from module toil.job.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Getting defaultdict from module collections.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Getting list from module __builtin__.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Getting ModuleDescriptor from module toil.resource.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Getting set from module __builtin__.
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.job:Loading user function helloWorld from module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld').
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    WARNING:toil.resource:Can't find resource for leader path '/home/jenkins/toil'
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/home/jenkins/toil', name='HelloWorld')
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    DEBUG:toil.worker:Stopping running chain of jobs: length of stack: 0, services: 0, checkpoint: False
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Worker log can be found at /tmp/2.1.all.q/toil-f77076fb-a621-4820-9689-5dfd23ce6d35/tmpoPNV6y. Set --cleanWorkDir to retain this log
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    INFO:toil.worker:Worker log can be found at /tmp/2.1.all.q/toil-f77076fb-a621-4820-9689-5dfd23ce6d35/tmpoPNV6y. Set --cleanWorkDir to retain this log
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    Finished running the chain of jobs on this node, we ran for a total of 0.002031 seconds
ip-172-31-29-42 2016-08-23 22:08:28,204 Thread-3 INFO toil.leader: 3/A/job_93tWo:    INFO:toil.worker:Finished running the chain of jobs on this node, we ran for a total of 0.002031 seconds
ip-172-31-29-42 2016-08-23 22:08:28,709 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([0])
ip-172-31-29-42 2016-08-23 22:08:28,715 Thread-1 DEBUG toil.batchSystems.gridengine: Exit Status: '0'
ip-172-31-29-42 2016-08-23 22:08:28,716 Thread-1 DEBUG toil.batchSystems.gridengine: List of running jobs: set([])
ip-172-31-29-42 2016-08-23 22:08:28,716 Thread-1 DEBUG toil.batchSystems.gridengine: No activity, sleeping for 1s
ip-172-31-29-42 2016-08-23 22:08:28,716 MainThread DEBUG toil.batchSystems.gridengine: UpdatedJobsQueue Item: (0, 0)
ip-172-31-29-42 2016-08-23 22:08:28,716 MainThread DEBUG toil.leader: Batch system is reporting that the jobWrapper with batch system ID: 0 and jobWrapper store ID: 3/A/job_93tWo ended successfully
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: No jobs left to run so exiting.
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: Finished the main loop
ip-172-31-29-42 2016-08-23 22:08:28,717 MainThread INFO toil.leader: Waiting for stats and logging collator thread to finish ...
ip-172-31-29-42 2016-08-23 22:08:29,206 MainThread INFO toil.leader: ... finished collating stats and logs. Took 0.489470005035 seconds
ip-172-31-29-42 2016-08-23 22:08:29,207 MainThread INFO toil.leader: Waiting for service manager thread to finish ...
ip-172-31-29-42 2016-08-23 22:08:29,699 Thread-2 DEBUG toil.leader: Received signal to quit starting services.
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread INFO toil.leader: ... finished shutting down the service manager. Took 0.492269039154 seconds
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread INFO toil.leader: Finished toil run successfully
ip-172-31-29-42 2016-08-23 22:08:29,699 MainThread DEBUG toil.common: Shutting down batch system ...
ip-172-31-29-42 2016-08-23 22:08:29,717 Thread-1 DEBUG toil.batchSystems.gridengine: Received queue sentinel.
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread DEBUG toil.common: ... finished shutting down the batch system in 0.0180327892303 seconds.
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread INFO toil.common: Attempting to delete the job store
ip-172-31-29-42 2016-08-23 22:08:29,718 MainThread INFO toil.common: Successfully deleted the job store
Hello, world!, here's a message: You did it!

Same commit. Only difference is that I need to set TOIL_GRIDENGINE_PE=smp which you might also want to try.

hannes-ucsc commented 8 years ago

Looking at the difference between your log and mine, I think the problem is in getJobExitCode.

Can you apply this patch (or make the change manually) and run again:

diff --git a/src/toil/batchSystems/gridengine.py b/src/toil/batchSystems/gridengine.py
index 02c2f39..6849c20 100644
--- a/src/toil/batchSystems/gridengine.py
+++ b/src/toil/batchSystems/gridengine.py
@@ -189,6 +189,7 @@ class Worker(Thread):
         args = ["qacct", "-j", str(job)]
         if task is not None:
             args.extend(["-t", str(task)])
+        logger.debug("Running %r", args)
         process = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
         for line in process.stdout:
             if line.startswith("failed") and int(line.split()[1]) == 1:

While HelloWorld.py is hanging at the end, can you run the qacct command line that the added debug statement prints out and post its output here?

zichner commented 8 years ago

Thank you very much for your input! The problem was indeed related to qacct. On our SGE, qacct was not properly set up so that qacct -j 123 returned error: job id 123 not found (even after the job finished).

With qacct working correctly, Toil is running properly.

Thanks again!

hannes-ucsc commented 8 years ago

Glad it works now.

I want to get that logging statement in permanently. Reopening for that reason.