DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

sort.py fails on Mesos too. #1718

Closed paul-sanghera closed 7 years ago

paul-sanghera commented 7 years ago

The command and output: mesosbox@ip-172-31-16-87:~$ python2.7 sort.py --fileToSort ./file_to_sort.txt --batchSystem=mesos --mesosMaster=mesos-master:5050 aws:us-west-2:my-jobstore ip-172-31-16-87: 2017-06-15 20:07:46,152 INFO: toil.lib.bioio: Logging set at level: INFO ip-172-31-16-87: 2017-06-15 20:07:46,152 INFO: toil.lib.bioio: Logging set at level: INFO ip-172-31-16-87: 2017-06-15 20:07:46,239 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:46,290 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:46,352 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--jobs'. ip-172-31-16-87: 2017-06-15 20:07:46,364 ERROR: boto: 400 Bad Request ip-172-31-16-87: 2017-06-15 20:07:46,364 ERROR: boto: <?xml version="1.0"?>

NoSuchDomainThe specified domain does not exist.0.00001372003959b25c-b3e1-0fc5-008e-ab0b9a40d52c

ip-172-31-16-87: 2017-06-15 20:07:47,448 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--files'. ip-172-31-16-87: 2017-06-15 20:07:47,471 ERROR: boto: 400 Bad Request ip-172-31-16-87: 2017-06-15 20:07:47,471 ERROR: boto: <?xml version="1.0"?>

NoSuchDomainThe specified domain does not exist.0.000013720060cee259-e503-594f-7bbf-4dd93157a5a3

ip-172-31-16-87: 2017-06-15 20:07:48,550 INFO: toil.jobStores.aws.jobStore: Binding to job store bucket 'my-jobstore--files'. ip-172-31-16-87: 2017-06-15 20:07:50,562 INFO: toil.jobStores.abstractJobStore: The workflow ID is: '497913ca-073b-4493-ac35-88a3af4073b1' ip-172-31-16-87: 2017-06-15 20:07:50,892 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:51,134 INFO: toil.common: Using the mesos batch system I0615 20:07:51.519343 2691 sched.cpp:164] Version: 0.25.0 I0615 20:07:51.521729 2701 sched.cpp:262] New master detected at master@172.31.16.87:5050 I0615 20:07:51.521889 2701 sched.cpp:272] No credentials provided. Attempting to register without authentication I0615 20:07:51.523216 2705 sched.cpp:641] Framework registered with e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-0004 ip-172-31-16-87: 2017-06-15 20:07:51,556 INFO: toil.common: Written the environment for the jobs to the environment file ip-172-31-16-87: 2017-06-15 20:07:51,556 INFO: toil.common: Caching all jobs in job store ip-172-31-16-87: 2017-06-15 20:07:51,850 INFO: toil.common: 0 jobs downloaded. ip-172-31-16-87: 2017-06-15 20:07:52,055 INFO: toil.realtimeLogger: Real-time logging disabled ip-172-31-16-87: 2017-06-15 20:07:52,056 INFO: toil.leader: (Re)building internal scheduler state ip-172-31-16-87: 2017-06-15 20:07:52,057 INFO: toil.leader: Checked batch system has no running jobs and no updated jobs ip-172-31-16-87: 2017-06-15 20:07:52,057 INFO: toil.leader: Found 1 jobs to start and 0 jobs with successors to run ip-172-31-16-87: 2017-06-15 20:07:52,060 INFO: toil.leader: Starting the main loop ip-172-31-16-87: 2017-06-15 20:07:57,315 INFO: toil.batchSystems.mesos.batchSystem: Preparing to launch Mesos task 0 using offer e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-O24 ... ip-172-31-16-87: 2017-06-15 20:07:57,315 INFO: toil.batchSystems.mesos.batchSystem: Launched Mesos task 0. ip-172-31-16-87: 2017-06-15 20:07:57,316 INFO: toil.batchSystems.mesos.batchSystem: Although there are queued jobs, none of them could be run with theoffers extended to the framework. Enable debug logging to see details about tasks queued and offers made. ip-172-31-16-87: 2017-06-15 20:08:00,848 WARNING: toil.leader: The jobWrapper seems to have left a log file, indicating failure: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: Reporting file: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: ---TOIL WORKER OUTPUT LOG--- ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: Traceback (most recent call last): ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/worker.py", line 330, in main ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: job._runner(jobWrapper=jobWrapper, jobStore=jobStore, fileStore=fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1043, in _runner ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: returnValues = self._run(jobWrapper, fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 991, in _run ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: return self.run(fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1136, in run ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/home/mesosbox/sort.py", line 33, in setup ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: AssertionError ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: Exiting the worker because of a failed jobWrapper on host ip-172-31-24-108 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: ERROR:toil.worker:Exiting the worker because of a failed jobWrapper on host ip-172-31-24-108 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job 6bbe5e4a-dac4-431c-b902-c1745c2aba98 to 0 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.jobWrapper:We have increased the default memory of the failed job to 2147483648 bytes ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: Job: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 is completely failed ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: No jobs left to run so exiting. ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: Finished the main loop ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: Waiting for stats and logging collator process to finish ... ip-172-31-16-87: 2017-06-15 20:08:00,976 INFO: toil.leader: ... finished collating stats and logs. Took 0.110114812851 seconds ip-172-31-16-87: 2017-06-15 20:08:00,976 INFO: toil.leader: Waiting for service manager thread to finish ... ip-172-31-16-87: 2017-06-15 20:08:01,058 INFO: toil.leader: ... finished shutting down the service manager. Took 0.0823500156403 seconds ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.leader: Finished toil run with 1 failed jobs ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.leader: Failed jobs at end of the run: set(['6bbe5e4a-dac4-431c-b902-c1745c2aba98']) ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.batchSystems.mesos.batchSystem: Stopping Mesos driver I0615 20:08:01.069613 2691 sched.cpp:1771] Asked to stop the driver I0615 20:08:01.069730 2700 sched.cpp:1040] Stopping framework 'e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-0004' ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.batchSystems.mesos.batchSystem: Joining Mesos driver ip-172-31-16-87: 2017-06-15 20:08:01,070 INFO: toil.batchSystems.mesos.batchSystem: Joined Mesos driver Traceback (most recent call last): File "sort.py", line 189, in main() File "sort.py", line 186, in main memory=sortMemory), options) File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 500, in startToil return toil.start(job) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 562, in start return self._runMainLoop(job) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 791, in _runMainLoop jobCache=self._jobCache) File "/usr/local/lib/python2.7/dist-packages/toil/leader.py", line 712, in mainLoop raise FailedJobsException( config.jobStore, len(toilState.totalFailedJobs) ) toil.leader.FailedJobsException: The job store 'aws:us-west-2:my-jobstore' contains 1 failed jobs

arkal commented 7 years ago

Can you rerun this with --logDebug and post the output please?

Also i believe this test was not meant to be run with mesos.--fileToSort will exist on the leader and not on the slave. This will cause job.fileStore.writeGlobalFile to fail in a few lines of code.

paul-sanghera commented 7 years ago

I see your point, Arjun. But, It there 'real' example (greater that HelloWorld.py) of a python program with Toil running in AWS cloud? Thanks. Since you asked, here is the output with --logDebug option: mesosbox@ip-172-31-24-29:~$ python2.7 sort.py --fileToSort ./file_to_sort.txt --logDebug --batchSystem=mesos --mesosMaster=mesos-master:5050 aws:us-west-2:my-jobstore ip-172-31-24-29: 2017-06-16 04:00:26,738 INFO: toil.lib.bioio: Logging set at level: DEBUG ip-172-31-24-29: 2017-06-16 04:00:26,739 INFO: toil.lib.bioio: Logging set at level: DEBUG ip-172-31-24-29: 2017-06-16 04:00:27,214 DEBUG: toil.jobStores.aws.jobStore: Instantiating <class 'toil.jobStores.aws.jobStore.AWSJobStore'> for region us-west-2 and name prefix 'my-jobstore' ip-172-31-24-29: 2017-06-16 04:00:27,222 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-24-29: 2017-06-16 04:00:27,298 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-24-29: 2017-06-16 04:00:27,339 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--jobs'. ip-172-31-24-29: 2017-06-16 04:00:28,465 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--files'. ip-172-31-24-29: 2017-06-16 04:00:29,568 INFO: toil.jobStores.aws.jobStore: Binding to job store bucket 'my-jobstore--files'. ip-172-31-24-29: 2017-06-16 04:00:29,654 DEBUG: toil.jobStores.aws.jobStore: Bucket 'my-jobstore--files' does not exist. ip-172-31-24-29: 2017-06-16 04:00:29,654 DEBUG: toil.jobStores.aws.jobStore: Creating bucket 'my-jobstore--files'. ip-172-31-24-29: 2017-06-16 04:00:30,584 DEBUG: toil.jobStores.aws.jobStore: Created new job store bucket 'my-jobstore--files'. ip-172-31-24-29: 2017-06-16 04:00:30,585 INFO: toil.jobStores.abstractJobStore: The workflow ID is: '7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4' ip-172-31-24-29: 2017-06-16 04:00:30,935 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='a77581f6-ad82-5142-acb7-5a6b01eab3fb', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='\x80\x02ctoil.co...ryCountq8K\x00ub.', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'config.pickle'. ip-172-31-24-29: 2017-06-16 04:00:30,935 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-24-29: 2017-06-16 04:00:31,511 INFO: toil.common: Using the mesos batch system ip-172-31-24-29: 2017-06-16 04:00:31,580 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='9ac1b61e-9bab-5686-b46b-0c21bf8f9a25', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='PK\x03\x04\x14\x...9\x00\x00\x00\x00', _numContentChunks=4) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file '1febc5e3fde8f3e607fc95e43a3711d5'. ip-172-31-24-29: 2017-06-16 04:00:32,921 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='6681340b-9a30-5117-9d8c-f805c4e9ebd8', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='toil.resource.Di...5243a983a6c8fbf"]', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'userScript'. I0616 04:00:32.944926 2348 sched.cpp:164] Version: 0.25.0 I0616 04:00:32.951406 2357 sched.cpp:262] New master detected at master@172.31.24.29:5050 I0616 04:00:32.951586 2357 sched.cpp:272] No credentials provided. Attempting to register without authentication I0616 04:00:32.962105 2360 sched.cpp:641] Framework registered with 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-0000 ip-172-31-24-29: 2017-06-16 04:00:32,962 DEBUG: toil.batchSystems.mesos.batchSystem: Registered with framework ID 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-0000 ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: There are no queued tasks. Declining Mesos offers. ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O0. ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O1. ip-172-31-24-29: 2017-06-16 04:00:32,989 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='01c37f76-e635-5f9c-ad18-6ed1ae1b2e2a', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='\x80\x02(cos\n_E...2.31.24.29 22usb.', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'environment.pickle'. ip-172-31-24-29: 2017-06-16 04:00:32,989 INFO: toil.common: Written the environment for the jobs to the environment file ip-172-31-24-29: 2017-06-16 04:00:32,989 INFO: toil.common: Caching all jobs in job store ip-172-31-24-29: 2017-06-16 04:00:33,274 INFO: toil.common: 0 jobs downloaded. ip-172-31-24-29: 2017-06-16 04:00:33,304 DEBUG: toil.jobStores.aws.jobStore: Created FileInfo(fileID='e79689e1-6416-4aa8-8ce5-3355270c0684', ownerID=None, encrypted=False, version=None, previousVersion=None, content=None, _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710>. ip-172-31-24-29: 2017-06-16 04:00:33,349 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='de90cdf4-4065-5763-8ae2-b71758c6f931', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='e79689e1-6416-4aa8-8ce5-3355270c0684', _numContentChunks=1) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'rootJobReturnValue'. ip-172-31-24-29: 2017-06-16 04:00:33,349 DEBUG: toil.jobStores.aws.jobStore: Creating job 8a6bf41d-be7a-4de6-a3c5-89068d621930 for '' ip-172-31-24-29: 2017-06-16 04:00:33,422 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='ac24b074-ebda-47a3-996a-a0faab6f90e2', ownerID='8a6bf41d-be7a-4de6-a3c5-89068d621930', encrypted=False, version='', previousVersion='', content='\x80\x02ctoil.jo...\x07_kwargsq#}ub.', _numContentChunks=1) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710>. ip-172-31-24-29: 2017-06-16 04:00:33,423 DEBUG: toil.jobStores.aws.jobStore: Updating job 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:33,451 DEBUG: toil.jobStores.aws.jobStore: Updating job 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:33,506 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='a09ece78-f7f5-580e-bcf3-59b8e4b7c1fa', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='8a6bf41d-be7a-4de6-a3c5-89068d621930', _numContentChunks=1) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'rootJobStoreID'. ip-172-31-24-29: 2017-06-16 04:00:33,506 INFO: toil.realtimeLogger: Real-time logging disabled ip-172-31-24-29: 2017-06-16 04:00:33,524 INFO: toil.leader: (Re)building internal scheduler state ip-172-31-24-29: 2017-06-16 04:00:33,524 DEBUG: toil.leader: Found job to run: 8a6bf41d-be7a-4de6-a3c5-89068d621930, with command: True, with checkpoint: False, with services: False, with stack: False ip-172-31-24-29: 2017-06-16 04:00:33,524 INFO: toil.leader: Checked batch system has no running jobs and no updated jobs ip-172-31-24-29: 2017-06-16 04:00:33,524 INFO: toil.leader: Found 1 jobs to start and 0 jobs with successors to run ip-172-31-24-29: 2017-06-16 04:00:33,551 INFO: toil.leader: Starting the main loop ip-172-31-24-29: 2017-06-16 04:00:33,551 DEBUG: toil.leader: Built the jobs list, currently have 1 jobs to update and 0 jobs issued ip-172-31-24-29: 2017-06-16 04:00:33,552 DEBUG: toil.leader: Updating status of job: 8a6bf41d-be7a-4de6-a3c5-89068d621930 with result status: 0 ip-172-31-24-29: 2017-06-16 04:00:33,552 DEBUG: toil.batchSystems.mesos.batchSystem: Queueing the job command: _toil_worker aws:us-west-2:my-jobstore 8a6bf41d-be7a-4de6-a3c5-89068d621930 with job id: 0 ... ip-172-31-24-29: 2017-06-16 04:00:33,552 DEBUG: toil.batchSystems.mesos.batchSystem: ... queued ip-172-31-24-29: 2017-06-16 04:00:33,552 DEBUG: toil.leader: Issued job with job store ID: 8a6bf41d-be7a-4de6-a3c5-89068d621930 and job batch system ID: 0 and cores: 1.00, disk: 2147483648.00, and memory: 1048576000.00 ip-172-31-24-29: 2017-06-16 04:00:38,733 DEBUG: toil.batchSystems.mesos.batchSystem: Received offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O2 with 6455.00 MiB memory, 2.00 core(s) and 23386.00 MiB of disk. ip-172-31-24-29: 2017-06-16 04:00:38,750 INFO: toil.batchSystems.mesos.batchSystem: Preparing to launch Mesos task 0 using offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O2 ... ip-172-31-24-29: 2017-06-16 04:00:38,750 INFO: toil.batchSystems.mesos.batchSystem: Launched Mesos task 0. ip-172-31-24-29: 2017-06-16 04:00:38,750 DEBUG: toil.batchSystems.mesos.batchSystem: Received offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O3 with 6455.00 MiB memory, 2.00 core(s) and 23386.00 MiB of disk. ip-172-31-24-29: 2017-06-16 04:00:38,751 INFO: toil.batchSystems.mesos.batchSystem: Although there are queued jobs, none of them could be run with theoffers extended to the framework. Enable debug logging to see details about tasks queued and offers made. ip-172-31-24-29: 2017-06-16 04:00:40,410 DEBUG: toil.batchSystems.mesos.batchSystem: Job 0 is in state 'TASK_RUNNING'. ip-172-31-24-29: 2017-06-16 04:00:44,256 DEBUG: toil.batchSystems.mesos.batchSystem: Job 0 is in state 'TASK_FINISHED'. ip-172-31-24-29: 2017-06-16 04:00:44,269 DEBUG: toil.batchSystems.mesos.batchSystem: Job 0 ended with status 0, took (3.698038101196289,) seconds. ip-172-31-24-29: 2017-06-16 04:00:44,269 DEBUG: toil.leader: Batch system is reporting that the jobWrapper with batch system ID: 0 and jobWrapper store ID: 8a6bf41d-be7a-4de6-a3c5-89068d621930 ended successfully ip-172-31-24-29: 2017-06-16 04:00:44,692 DEBUG: toil.leader: Job 8a6bf41d-be7a-4de6-a3c5-89068d621930 continues to exist (i.e. has more to do) ip-172-31-24-29: 2017-06-16 04:00:44,702 DEBUG: toil.jobStores.aws.jobStore: Loaded job 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:44,702 WARNING: toil.leader: The jobWrapper seems to have left a log file, indicating failure: 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:44,722 DEBUG: toil.jobStores.aws.jobStore: Reading FileInfo(fileID='9cf0aea6-f26e-4166-818a-2e76370c8662', ownerID='8a6bf41d-be7a-4de6-a3c5-89068d621930', encrypted=False, version='', previousVersion='', content='---TOIL WORKER O...147483648 bytes\n', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> into stream. ip-172-31-24-29: 2017-06-16 04:00:44,723 WARNING: toil.leader: Reporting file: 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:44,723 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: ---TOIL WORKER OUTPUT LOG--- ip-172-31-24-29: 2017-06-16 04:00:44,723 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: Next available file descriptor: 7 ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.worker:Next available file descriptor: 7 ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.jobStores.aws.jobStore:Loaded job 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: Parsed jobWrapper ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.worker:Parsed jobWrapper ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: Got a command to run: _toil ac24b074-ebda-47a3-996a-a0faab6f90e2 /home/mesosbox sort ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.worker:Got a command to run: _toil ac24b074-ebda-47a3-996a-a0faab6f90e2 /home/mesosbox sort ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Loading user module ModuleDescriptor(dirPath='/home/mesosbox', name='sort'). ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-24-29: 2017-06-16 04:00:44,724 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.jobStores.aws.jobStore:Reading FileInfo(fileID='ac24b074-ebda-47a3-996a-a0faab6f90e2', ownerID='8a6bf41d-be7a-4de6-a3c5-89068d621930', encrypted=False, version='', previousVersion='', content='\x80\x02ctoil.jo...\x07_kwargsq#}ub.', _numContentChunks=1) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7f95d503fc90> into stream. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Getting JobFunctionWrappingJob from module toil.job. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Getting defaultdict from module collections. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Getting list from module builtin. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Getting ModuleDescriptor from module toil.resource. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Getting set from module builtin. ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Obtained lock on file /var/lib/toil/toil-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/cache-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/.cacheLock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Released lock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Obtained lock on file /var/lib/toil/toil-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/cache-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/.cacheLock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Released lock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Obtained lock on file /var/lib/toil/toil-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/cache-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/.cacheLock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Released lock ip-172-31-24-29: 2017-06-16 04:00:44,725 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.job:Loading user function setup from module ModuleDescriptor(dirPath='/home/mesosbox', name='sort'). ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Obtained lock on file /var/lib/toil/toil-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/cache-7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4/.cacheLock ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.fileStore:CACHE: Released lock ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: Traceback (most recent call last): ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: File "/usr/local/lib/python2.7/dist-packages/toil/worker.py", line 330, in main ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: job._runner(jobWrapper=jobWrapper, jobStore=jobStore, fileStore=fileStore) ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1043, in _runner ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: returnValues = self._run(jobWrapper, fileStore) ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 991, in _run ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: return self.run(fileStore) ip-172-31-24-29: 2017-06-16 04:00:44,726 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1136, in run ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: File "/home/mesosbox/sort.py", line 33, in setup ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: AssertionError ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: Exiting the worker because of a failed jobWrapper on host ip-172-31-22-62 ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: ERROR:toil.worker:Exiting the worker because of a failed jobWrapper on host ip-172-31-22-62 ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: DEBUG:toil.jobStores.aws.jobStore:Loaded job 8a6bf41d-be7a-4de6-a3c5-89068d621930 ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job 8a6bf41d-be7a-4de6-a3c5-89068d621930 to 0 ip-172-31-24-29: 2017-06-16 04:00:44,727 WARNING: toil.leader: 8a6bf41d-be7a-4de6-a3c5-89068d621930: WARNING:toil.jobWrapper:We have increased the default memory of the failed job to 2147483648 bytes ip-172-31-24-29: 2017-06-16 04:00:44,727 DEBUG: toil.leader: Added jobWrapper: 8a6bf41d-be7a-4de6-a3c5-89068d621930 to active jobs ip-172-31-24-29: 2017-06-16 04:00:44,727 DEBUG: toil.leader: Built the jobs list, currently have 1 jobs to update and 0 jobs issued ip-172-31-24-29: 2017-06-16 04:00:44,728 DEBUG: toil.leader: Updating status of job: 8a6bf41d-be7a-4de6-a3c5-89068d621930 with result status: 0 ip-172-31-24-29: 2017-06-16 04:00:44,728 WARNING: toil.leader: Job: 8a6bf41d-be7a-4de6-a3c5-89068d621930 is completely failed ip-172-31-24-29: 2017-06-16 04:00:44,728 INFO: toil.leader: No jobs left to run so exiting. ip-172-31-24-29: 2017-06-16 04:00:44,728 INFO: toil.leader: Finished the main loop ip-172-31-24-29: 2017-06-16 04:00:44,728 INFO: toil.leader: Waiting for stats and logging collator process to finish ... ip-172-31-24-29: 2017-06-16 04:00:44,736 DEBUG: toil.batchSystems.mesos.batchSystem: There are no queued tasks. Declining Mesos offers. ip-172-31-24-29: 2017-06-16 04:00:44,736 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O4. ip-172-31-24-29: 2017-06-16 04:00:44,736 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O5. ip-172-31-24-29: 2017-06-16 04:00:45,778 INFO: toil.leader: ... finished collating stats and logs. Took 1.05017995834 seconds ip-172-31-24-29: 2017-06-16 04:00:45,778 INFO: toil.leader: Waiting for service manager thread to finish ... ip-172-31-24-29: 2017-06-16 04:00:46,526 DEBUG: toil.leader: Received signal to quit starting services. ip-172-31-24-29: 2017-06-16 04:00:46,527 INFO: toil.leader: ... finished shutting down the service manager. Took 0.748196125031 seconds ip-172-31-24-29: 2017-06-16 04:00:46,540 INFO: toil.leader: Finished toil run with 1 failed jobs ip-172-31-24-29: 2017-06-16 04:00:46,540 INFO: toil.leader: Failed jobs at end of the run: set(['8a6bf41d-be7a-4de6-a3c5-89068d621930']) ip-172-31-24-29: 2017-06-16 04:00:46,540 DEBUG: toil.common: Shutting down batch system ... ip-172-31-24-29: 2017-06-16 04:00:46,540 INFO: toil.batchSystems.mesos.batchSystem: Stopping Mesos driver I0616 04:00:46.540859 2348 sched.cpp:1771] Asked to stop the driver ip-172-31-24-29: 2017-06-16 04:00:46,540 INFO: toil.batchSystems.mesos.batchSystem: Joining Mesos driver I0616 04:00:46.540997 2358 sched.cpp:1040] Stopping framework '4e4a1aa5-9a38-40d8-86ec-bb07af96a469-0000' ip-172-31-24-29: 2017-06-16 04:00:46,541 INFO: toil.batchSystems.mesos.batchSystem: Joined Mesos driver ip-172-31-24-29: 2017-06-16 04:00:46,541 DEBUG: toil.common: ... finished shutting down the batch system in 0.00083589553833 seconds. Traceback (most recent call last): File "sort.py", line 189, in main() File "sort.py", line 186, in main memory=sortMemory), options) File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 500, in startToil return toil.start(job) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 562, in start return self._runMainLoop(job) File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 791, in _runMainLoop jobCache=self._jobCache) File "/usr/local/lib/python2.7/dist-packages/toil/leader.py", line 712, in mainLoop raise FailedJobsException( config.jobStore, len(toilState.totalFailedJobs) ) toil.leader.FailedJobsException: The job store 'aws:us-west-2:my-jobstore' contains 1 failed jobs mesosbox@ip-172-31-24-29:~$

ejacox commented 7 years ago

Fixed in #1746