DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

Toil with gridEngine error #2511

Closed phanikishore2 closed 2 years ago

phanikishore2 commented 5 years ago

Getting the following error running on grid engine.Running the example cwl workflow in documentation. https://toil.readthedocs.io/en/latest/gettingStarted/quickStart.html#cwlquickstart

Using toil version 3.18.0.

(venv) pkd7@aether-qsub1:~$ cwltoil --batchSystem=gridengine --jobStore /home/pkd7/trail6 --logDebug example.cwl example-job.yaml DEBUG:toil.lib.bioio:Root logger is at level 'DEBUG', 'toil' logger at level 'DEBUG'. DEBUG:toil.jobStores.fileJobStore:Path to job store directory is '/home/pkd7/trail6'. DEBUG:toil.jobStores.abstractJobStore:The workflow ID is: 'e56aec97-47fe-4e2b-8319-0689a16b3727' INFO:cwltool:Resolved 'example.cwl' to 'file:///home/pkd7/example.cwl' DEBUG:toil.resource:Module dir is /home/pkd7/venv/lib/python2.7/site-packages DEBUG:toil.common:Using the grid engine batch system WARNING:toil.batchSystems.singleMachine:Limiting maxMemory to physically available memory (33736310784). DEBUG:toil.common:Obtained node ID 5f570a62d818f48e77a178795980c34f from file /var/lib/dbus/machine-id DEBUG:toil.common:Created the workflow directory at /tmp/toil-e56aec97-47fe-4e2b-8319-0689a16b3727-5f570a62d818f48e77a178795980c34f WARNING:toil.batchSystems.singleMachine:Limiting maxDisk to physically available disk (4702879744). DEBUG:toil.batchSystems.singleMachine:Setting up the thread pool with 80 workers, given a minimum CPU fraction of 0.100000 and a maximum CPU value of 8. DEBUG:toil.common:User script ModuleDescriptor(dirPath='/home/pkd7/venv/lib/python2.7/site-packages', name='toil.cwl.cwltoil', fromVirtualEnv=True) belongs to Toil. No need to auto-deploy it. DEBUG:toil.common:No user script to auto-deploy. DEBUG:toil.common:Written the environment for the jobs to the environment file DEBUG:toil.common:Caching all jobs in job store DEBUG:toil.common:0 jobs downloaded. INFO:toil:Running Toil version 3.18.0-84239d802248a5f4a220e762b3b8ce5cc92af0be. DEBUG:toil:Configuration: {'maxLocalJobs': 8, 'rescueJobsFrequency': 3600, 'logLevel': 'DEBUG', 'minNodes': None, 'targetTime': 1800, 'jobStore': 'file:/home/pkd7/trail6', 'linkImports': True, 'manualMemArgs': False, 'forceDockerAppliance': False, 'nodeOptions': None, 'nodeTypes': [], 'servicePollingInterval': 60, 'workDir': None, 'stats': False, 'disableCaching': True, 'maxPreemptableServiceJobs': 9223372036854775807, 'environment': {}, 'parasolMaxBatches': 10000, 'cleanWorkDir': 'always', 'disableChaining': False, 'maxCores': 9223372036854775807, 'sseKey': None, 'maxMemory': 9223372036854775807, 'maxDisk': 9223372036854775807, 'cwl': False, 'scaleInterval': 60, 'deadlockWait': 60, 'defaultPreemptable': False, 'clusterStats': None, 'defaultCores': 1, 'cseKey': None, 'betaInertia': 0.1, 'metrics': False, 'maxNodes': [10], 'scale': 1, 'writeLogs': None, 'disableAutoDeployment': False, 'badWorker': 0.0, 'defaultDisk': 2147483648, 'mesosMasterAddress': '172.24.221.238:5050', 'restart': False, 'useAsync': True, 'preemptableCompensation': 0.0, 'parasolCommand': 'parasol', 'workflowID': 'e56aec97-47fe-4e2b-8319-0689a16b3727', 'maxServiceJobs': 9223372036854775807, 'readGlobalFileMutableByDefault': False, 'badWorkerFailInterval': 0.01, 'statePollingWait': 1, 'debugWorker': False, 'maxLogFileSize': 64000, 'defaultMemory': 2147483648, 'workflowAttemptNumber': 0, 'maxJobDuration': 9223372036854775807, 'clean': 'onSuccess', 'provisioner': None, 'batchSystem': 'gridengine', 'retryCount': 1, 'writeLogsGzip': None, 'nodeStorage': 50} DEBUG:toil.realtimeLogger:Real-time logging disabled DEBUG:toil.toilState:Found job to run: O/C/jobpBhm_a, with command: True, with checkpoint: False, with services: False, with stack: False DEBUG:toil.leader:Found 1 jobs to start and 0 jobs with successors to run DEBUG:toil.leader:Checked batch system has no running jobs and no updated jobs DEBUG:toil.serviceManager:Initializing service manager DEBUG:toil.leader:Built the jobs list, currently have 1 jobs to update and 0 jobs issued DEBUG:toil.leader:Updating status of job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a: with result status: 0 DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Issued the job command: /home/pkd7/venv/bin/_toil_worker file:///home/pkd7/example.cwl file:/home/pkd7/trail6 O/C/jobpBhm_a with job id: 0 INFO:toil.leader:Issued job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with job batch system ID: 0 and cores: 1, disk: 3.0 G, and memory: 2.0 G DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Running ['qsub', '-V', '-b', 'y', '-terse', '-j', 'y', '-cwd', '-N', 'toil_job_0', '-hard', '-l', u'vf=2097152K,h_vmem=2097152K', '/home/pkd7/venv/bin/_toil_worker file:///home/pkd7/example.cwl file:/home/pkd7/trail6 O/C/jobpBhm_a'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Submitted job 171724 DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171724'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171724'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171724'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171724'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171724'] DEBUG:toil.batchSystems.gridengine:Exit Status: '127' DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:UpdatedJobsQueue Item: (0, 127) WARNING:toil.leader:Job failed with exit value 127: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a DEBUG:toil.leader:Job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a continues to exist (i.e. has more to do) WARNING:toil.leader:No log file is present, despite job failing: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a to 1 DEBUG:toil.leader:Added job: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a to active jobs DEBUG:toil.leader:Built the jobs list, currently have 1 jobs to update and 0 jobs issued DEBUG:toil.leader:Updating status of job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a: with result status: 127 DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Issued the job command: /home/pkd7/venv/bin/_toil_worker file:///home/pkd7/example.cwl file:/home/pkd7/trail6 O/C/jobpBhm_a with job id: 1 INFO:toil.leader:Issued job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with job batch system ID: 1 and cores: 1, disk: 3.0 G, and memory: 2.0 G DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Running ['qsub', '-V', '-b', 'y', '-terse', '-j', 'y', '-cwd', '-N', 'toil_job_1', '-hard', '-l', u'vf=2097152K,h_vmem=2097152K', '/home/pkd7/venv/bin/_toil_worker file:///home/pkd7/example.cwl file:/home/pkd7/trail6 O/C/jobpBhm_a'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Submitted job 171725 DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.gridengine:Running ['qacct', '-j', '171725'] DEBUG:toil.batchSystems.gridengine:Exit Status: '127' DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:UpdatedJobsQueue Item: (1, 127) WARNING:toil.leader:Job failed with exit value 127: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a DEBUG:toil.leader:Job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a continues to exist (i.e. has more to do) WARNING:toil.leader:No log file is present, despite job failing: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a to 0 DEBUG:toil.leader:Added job: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a to active jobs DEBUG:toil.leader:Built the jobs list, currently have 1 jobs to update and 0 jobs issued DEBUG:toil.leader:Updating status of job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a: with result status: 127 DEBUG:toil.leader:Found new failed successors: of job: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a WARNING:toil.leader:Job 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a with ID O/C/jobpBhm_a is completely failed DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.leader:Finished the main loop: no jobs left to run. DEBUG:toil.serviceManager:Waiting for service manager thread to finish ... DEBUG:toil.serviceManager:Received signal to quit starting services. DEBUG:toil.serviceManager:... finished shutting down the service manager. Took 0.499466180801 seconds DEBUG:toil.statsAndLogging:Waiting for stats and logging collator thread to finish ... DEBUG:toil.statsAndLogging:... finished collating stats and logs. Took 0.100440979004 seconds INFO:toil.leader:Finished toil run with 1 failed jobs. INFO:toil.leader:Failed jobs at end of the run: 'file:///home/pkd7/example.cwl' echo O/C/jobpBhm_a DEBUG:toil.common:Shutting down batch system ... DEBUG:toil.common:Obtained node ID 5f570a62d818f48e77a178795980c34f from file /var/lib/dbus/machine-id DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:No activity, sleeping for 1s DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Received queue sentinel. DEBUG:toil.common:... finished shutting down the batch system in 0.344773054123 seconds. Traceback (most recent call last): File "/home/pkd7/venv/bin/cwltoil", line 11, in sys.exit(main()) File "/home/pkd7/venv/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1220, in main outobj = toil.start(wf1) File "/home/pkd7/venv/lib/python2.7/site-packages/toil/common.py", line 784, in start return self._runMainLoop(rootJobGraph) File "/home/pkd7/venv/lib/python2.7/site-packages/toil/common.py", line 1059, in _runMainLoop jobCache=self._jobCache).run() File "/home/pkd7/venv/lib/python2.7/site-packages/toil/leader.py", line 237, in run raise FailedJobsException(self.config.jobStore, self.toilState.totalFailedJobs, self.jobStore) toil.leader.FailedJobsException

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-56

phanikishore2 commented 5 years ago

Seems like the issue is with spaces in command. subprocess.popen seems to break with spaces in command.

I made following change in gridEngine.py

     def prepareSubmission(self, cpu, memory, jobID, command):
        return self.prepareQsub(cpu, memory, jobID) + command.split()  

Originally it was

     def prepareSubmission(self, cpu, memory, jobID, command):
        return self.prepareQsub(cpu, memory, jobID) + [command]

I am using python 2.7.2 and OS is Ubuntu 16.04.5 LTS

DailyDreaming commented 5 years ago

@phanikishore2 Glad it was solved. If you'd like to submit a PR, we can apply the fix. Though I'd recommend using shlex.split() (https://docs.python.org/3/library/shlex.html) instead of the normal splitting on spaces as it tends to interpret quotes intelligibly.

mr-c commented 2 years ago

Fixed in #4150