Closed paul-sanghera closed 7 years ago
Can you rerun this with --logDebug and post the output please?
Also i believe this test was not meant to be run with mesos.--fileToSort
will exist on the leader and not on the slave. This will cause job.fileStore.writeGlobalFile
to fail in a few lines of code.
I see your point, Arjun. But, It there 'real' example (greater that HelloWorld.py) of a python program with Toil running in AWS cloud? Thanks.
Since you asked, here is the output with --logDebug option:
mesosbox@ip-172-31-24-29:~$ python2.7 sort.py --fileToSort ./file_to_sort.txt --logDebug --batchSystem=mesos --mesosMaster=mesos-master:5050 aws:us-west-2:my-jobstore
ip-172-31-24-29: 2017-06-16 04:00:26,738 INFO: toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-24-29: 2017-06-16 04:00:26,739 INFO: toil.lib.bioio: Logging set at level: DEBUG
ip-172-31-24-29: 2017-06-16 04:00:27,214 DEBUG: toil.jobStores.aws.jobStore: Instantiating <class 'toil.jobStores.aws.jobStore.AWSJobStore'> for region us-west-2 and name prefix 'my-jobstore'
ip-172-31-24-29: 2017-06-16 04:00:27,222 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'.
ip-172-31-24-29: 2017-06-16 04:00:27,298 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'.
ip-172-31-24-29: 2017-06-16 04:00:27,339 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--jobs'.
ip-172-31-24-29: 2017-06-16 04:00:28,465 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--files'.
ip-172-31-24-29: 2017-06-16 04:00:29,568 INFO: toil.jobStores.aws.jobStore: Binding to job store bucket 'my-jobstore--files'.
ip-172-31-24-29: 2017-06-16 04:00:29,654 DEBUG: toil.jobStores.aws.jobStore: Bucket 'my-jobstore--files' does not exist.
ip-172-31-24-29: 2017-06-16 04:00:29,654 DEBUG: toil.jobStores.aws.jobStore: Creating bucket 'my-jobstore--files'.
ip-172-31-24-29: 2017-06-16 04:00:30,584 DEBUG: toil.jobStores.aws.jobStore: Created new job store bucket 'my-jobstore--files'.
ip-172-31-24-29: 2017-06-16 04:00:30,585 INFO: toil.jobStores.abstractJobStore: The workflow ID is: '7bab4811-fb5a-48b0-9ce1-a6e1e84d94f4'
ip-172-31-24-29: 2017-06-16 04:00:30,935 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='a77581f6-ad82-5142-acb7-5a6b01eab3fb', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='\x80\x02ctoil.co...ryCountq8K\x00ub.', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'config.pickle'.
ip-172-31-24-29: 2017-06-16 04:00:30,935 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'.
ip-172-31-24-29: 2017-06-16 04:00:31,511 INFO: toil.common: Using the mesos batch system
ip-172-31-24-29: 2017-06-16 04:00:31,580 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='9ac1b61e-9bab-5686-b46b-0c21bf8f9a25', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='PK\x03\x04\x14\x...9\x00\x00\x00\x00', _numContentChunks=4) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file '1febc5e3fde8f3e607fc95e43a3711d5'.
ip-172-31-24-29: 2017-06-16 04:00:32,921 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='6681340b-9a30-5117-9d8c-f805c4e9ebd8', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='toil.resource.Di...5243a983a6c8fbf"]', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'userScript'.
I0616 04:00:32.944926 2348 sched.cpp:164] Version: 0.25.0
I0616 04:00:32.951406 2357 sched.cpp:262] New master detected at master@172.31.24.29:5050
I0616 04:00:32.951586 2357 sched.cpp:272] No credentials provided. Attempting to register without authentication
I0616 04:00:32.962105 2360 sched.cpp:641] Framework registered with 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-0000
ip-172-31-24-29: 2017-06-16 04:00:32,962 DEBUG: toil.batchSystems.mesos.batchSystem: Registered with framework ID 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-0000
ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: There are no queued tasks. Declining Mesos offers.
ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O0.
ip-172-31-24-29: 2017-06-16 04:00:32,963 DEBUG: toil.batchSystems.mesos.batchSystem: Declining offer 4e4a1aa5-9a38-40d8-86ec-bb07af96a469-O1.
ip-172-31-24-29: 2017-06-16 04:00:32,989 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='01c37f76-e635-5f9c-ad18-6ed1ae1b2e2a', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='\x80\x02(cos\n_E...2.31.24.29 22usb.', _numContentChunks=2) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'environment.pickle'.
ip-172-31-24-29: 2017-06-16 04:00:32,989 INFO: toil.common: Written the environment for the jobs to the environment file
ip-172-31-24-29: 2017-06-16 04:00:32,989 INFO: toil.common: Caching all jobs in job store
ip-172-31-24-29: 2017-06-16 04:00:33,274 INFO: toil.common: 0 jobs downloaded.
ip-172-31-24-29: 2017-06-16 04:00:33,304 DEBUG: toil.jobStores.aws.jobStore: Created FileInfo(fileID='e79689e1-6416-4aa8-8ce5-3355270c0684', ownerID=None, encrypted=False, version=None, previousVersion=None, content=None, _numContentChunks=0) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710>.
ip-172-31-24-29: 2017-06-16 04:00:33,349 DEBUG: toil.jobStores.aws.jobStore: Wrote FileInfo(fileID='de90cdf4-4065-5763-8ae2-b71758c6f931', ownerID='891f7db6-e4d9-4221-a58e-ab6cc4395f94', encrypted=False, version='', previousVersion='', content='e79689e1-6416-4aa8-8ce5-3355270c0684', _numContentChunks=1) bound to <toil.jobStores.aws.jobStore.AWSJobStore object at 0x7fb820033710> for shared file 'rootJobReturnValue'.
ip-172-31-24-29: 2017-06-16 04:00:33,349 DEBUG: toil.jobStores.aws.jobStore: Creating job 8a6bf41d-be7a-4de6-a3c5-89068d621930 for '
Fixed in #1746
The command and output: mesosbox@ip-172-31-16-87:~$ python2.7 sort.py --fileToSort ./file_to_sort.txt --batchSystem=mesos --mesosMaster=mesos-master:5050 aws:us-west-2:my-jobstore ip-172-31-16-87: 2017-06-15 20:07:46,152 INFO: toil.lib.bioio: Logging set at level: INFO ip-172-31-16-87: 2017-06-15 20:07:46,152 INFO: toil.lib.bioio: Logging set at level: INFO ip-172-31-16-87: 2017-06-15 20:07:46,239 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:46,290 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:46,352 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--jobs'. ip-172-31-16-87: 2017-06-15 20:07:46,364 ERROR: boto: 400 Bad Request ip-172-31-16-87: 2017-06-15 20:07:46,364 ERROR: boto: <?xml version="1.0"?>
NoSuchDomain
ip-172-31-16-87: 2017-06-15 20:07:47,448 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'my-jobstore--files'. ip-172-31-16-87: 2017-06-15 20:07:47,471 ERROR: boto: 400 Bad Request ip-172-31-16-87: 2017-06-15 20:07:47,471 ERROR: boto: <?xml version="1.0"?>
NoSuchDomain
ip-172-31-16-87: 2017-06-15 20:07:48,550 INFO: toil.jobStores.aws.jobStore: Binding to job store bucket 'my-jobstore--files'. ip-172-31-16-87: 2017-06-15 20:07:50,562 INFO: toil.jobStores.abstractJobStore: The workflow ID is: '497913ca-073b-4493-ac35-88a3af4073b1' ip-172-31-16-87: 2017-06-15 20:07:50,892 INFO: toil.jobStores.aws.jobStore: Binding to job store domain 'toil-registry'. ip-172-31-16-87: 2017-06-15 20:07:51,134 INFO: toil.common: Using the mesos batch system I0615 20:07:51.519343 2691 sched.cpp:164] Version: 0.25.0 I0615 20:07:51.521729 2701 sched.cpp:262] New master detected at master@172.31.16.87:5050 I0615 20:07:51.521889 2701 sched.cpp:272] No credentials provided. Attempting to register without authentication I0615 20:07:51.523216 2705 sched.cpp:641] Framework registered with e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-0004 ip-172-31-16-87: 2017-06-15 20:07:51,556 INFO: toil.common: Written the environment for the jobs to the environment file ip-172-31-16-87: 2017-06-15 20:07:51,556 INFO: toil.common: Caching all jobs in job store ip-172-31-16-87: 2017-06-15 20:07:51,850 INFO: toil.common: 0 jobs downloaded. ip-172-31-16-87: 2017-06-15 20:07:52,055 INFO: toil.realtimeLogger: Real-time logging disabled ip-172-31-16-87: 2017-06-15 20:07:52,056 INFO: toil.leader: (Re)building internal scheduler state ip-172-31-16-87: 2017-06-15 20:07:52,057 INFO: toil.leader: Checked batch system has no running jobs and no updated jobs ip-172-31-16-87: 2017-06-15 20:07:52,057 INFO: toil.leader: Found 1 jobs to start and 0 jobs with successors to run ip-172-31-16-87: 2017-06-15 20:07:52,060 INFO: toil.leader: Starting the main loop ip-172-31-16-87: 2017-06-15 20:07:57,315 INFO: toil.batchSystems.mesos.batchSystem: Preparing to launch Mesos task 0 using offer e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-O24 ... ip-172-31-16-87: 2017-06-15 20:07:57,315 INFO: toil.batchSystems.mesos.batchSystem: Launched Mesos task 0. ip-172-31-16-87: 2017-06-15 20:07:57,316 INFO: toil.batchSystems.mesos.batchSystem: Although there are queued jobs, none of them could be run with theoffers extended to the framework. Enable debug logging to see details about tasks queued and offers made. ip-172-31-16-87: 2017-06-15 20:08:00,848 WARNING: toil.leader: The jobWrapper seems to have left a log file, indicating failure: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: Reporting file: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: ---TOIL WORKER OUTPUT LOG--- ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-16-87: 2017-06-15 20:08:00,863 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.resource:The localize() method should only be invoked on a worker. ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: Traceback (most recent call last): ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/worker.py", line 330, in main ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: job._runner(jobWrapper=jobWrapper, jobStore=jobStore, fileStore=fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1043, in _runner ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: returnValues = self._run(jobWrapper, fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 991, in _run ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: return self.run(fileStore) ip-172-31-16-87: 2017-06-15 20:08:00,864 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 1136, in run ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs) ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: File "/home/mesosbox/sort.py", line 33, in setup ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: AssertionError ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: Exiting the worker because of a failed jobWrapper on host ip-172-31-24-108 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: ERROR:toil.worker:Exiting the worker because of a failed jobWrapper on host ip-172-31-24-108 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job 6bbe5e4a-dac4-431c-b902-c1745c2aba98 to 0 ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: 6bbe5e4a-dac4-431c-b902-c1745c2aba98: WARNING:toil.jobWrapper:We have increased the default memory of the failed job to 2147483648 bytes ip-172-31-16-87: 2017-06-15 20:08:00,865 WARNING: toil.leader: Job: 6bbe5e4a-dac4-431c-b902-c1745c2aba98 is completely failed ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: No jobs left to run so exiting. ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: Finished the main loop ip-172-31-16-87: 2017-06-15 20:08:00,865 INFO: toil.leader: Waiting for stats and logging collator process to finish ... ip-172-31-16-87: 2017-06-15 20:08:00,976 INFO: toil.leader: ... finished collating stats and logs. Took 0.110114812851 seconds ip-172-31-16-87: 2017-06-15 20:08:00,976 INFO: toil.leader: Waiting for service manager thread to finish ... ip-172-31-16-87: 2017-06-15 20:08:01,058 INFO: toil.leader: ... finished shutting down the service manager. Took 0.0823500156403 seconds ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.leader: Finished toil run with 1 failed jobs ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.leader: Failed jobs at end of the run: set(['6bbe5e4a-dac4-431c-b902-c1745c2aba98']) ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.batchSystems.mesos.batchSystem: Stopping Mesos driver I0615 20:08:01.069613 2691 sched.cpp:1771] Asked to stop the driver I0615 20:08:01.069730 2700 sched.cpp:1040] Stopping framework 'e2c3a9b8-026f-4b6c-9c6b-1b0e60519d93-0004' ip-172-31-16-87: 2017-06-15 20:08:01,069 INFO: toil.batchSystems.mesos.batchSystem: Joining Mesos driver ip-172-31-16-87: 2017-06-15 20:08:01,070 INFO: toil.batchSystems.mesos.batchSystem: Joined Mesos driver Traceback (most recent call last): File "sort.py", line 189, in
main()
File "sort.py", line 186, in main
memory=sortMemory), options)
File "/usr/local/lib/python2.7/dist-packages/toil/job.py", line 500, in startToil
return toil.start(job)
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 562, in start
return self._runMainLoop(job)
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 791, in _runMainLoop
jobCache=self._jobCache)
File "/usr/local/lib/python2.7/dist-packages/toil/leader.py", line 712, in mainLoop
raise FailedJobsException( config.jobStore, len(toilState.totalFailedJobs) )
toil.leader.FailedJobsException: The job store 'aws:us-west-2:my-jobstore' contains 1 failed jobs