Closed arkal closed 7 years ago
The EOFError occurs during unpickling of a job and it is completely baffling. @arkal, if you happen to have a stack trace, can you paste it here?
The problem might be more wide-spread than what the title suggests. @jpfeil ran into this, too, in a slightly different scenario. The general condition is that promises are wired in a way that is inconsistent with the job wiring, leading to a failure of promise resolution because the job producing the value for a promise hasn't been run yet.
@hannes, here's the stack trace
[centos@arjun-fake-podk toil_tests]$ python toy_code_for_unfulfilled_promise_error.py /tmp/toy
No handlers could be found for logger "toil.resource"
INFO:toil.lib.bioio:Logging set at level: INFO
INFO:toil.common:Using the single machine batch system
INFO:toil.jobStores.fileJobStore:Jobstore directory is: /tmp/toy
WARNING:toil.batchSystems.singleMachine:Limiting maxCores to CPU count of system (8).
INFO:toil.batchSystems.singleMachine:Setting up the thread pool with 80 workers, given a minimum CPU fraction of 0.100000 and a maximum CPU value of 8.
INFO:toil.common:Written the environment for the jobs to the environment file
WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py').
WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py').
INFO:toil.leader:Checked batch system has no running jobs and no updated jobs
INFO:toil.leader:Found 1 jobs to start and 0 jobs with successors to run
INFO:toil.leader:Starting the main loop
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy s/M/jobFyhHec'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy v/n/job1CCeE5'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy c/6/jobLkkyWy'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy O/0/jobMUjEkT'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy c/i/jobxGSHT4'.
WARNING:toil.leader:The jobWrapper seems to have left a log file, indicating failure: c/i/jobxGSHT4
WARNING:toil.leader:Reporting file: c/i/jobxGSHT4
WARNING:toil.leader:c/i/jobxGSHT4: ---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:c/i/jobxGSHT4: WARNING:toil.resource:Can't find resource for leader path '/data/test_docker/toil_tests/toy_code_for_unfulfilled_promise_error.py'
WARNING:toil.leader:c/i/jobxGSHT4: WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py')
WARNING:toil.leader:c/i/jobxGSHT4: INFO:toil.jobStores.fileJobStore:Jobstore directory is: /tmp/toy
WARNING:toil.leader:c/i/jobxGSHT4: Traceback (most recent call last):
WARNING:toil.leader:c/i/jobxGSHT4: File "/usr/local/lib/python2.7/site-packages/toil/worker.py", line 272, in main
WARNING:toil.leader:c/i/jobxGSHT4: job = Job._loadJob(jobWrapper.command, jobStore)
WARNING:toil.leader:c/i/jobxGSHT4: File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 835, in _loadJob
WARNING:toil.leader:c/i/jobxGSHT4: return cls._unpickle(userModule, fileHandle)
WARNING:toil.leader:c/i/jobxGSHT4: File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 856, in _unpickle
WARNING:toil.leader:c/i/jobxGSHT4: return unpickler.load()
WARNING:toil.leader:c/i/jobxGSHT4: EOFError
WARNING:toil.leader:c/i/jobxGSHT4: Exiting the worker because of a failed jobWrapper on host arjun-fake-podk.novalocal
WARNING:toil.leader:c/i/jobxGSHT4: ERROR:__main__:Exiting the worker because of a failed jobWrapper on host arjun-fake-podk.novalocal
WARNING:toil.leader:c/i/jobxGSHT4: WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job c/i/jobxGSHT4 to 0
WARNING:toil.leader:Job: c/i/jobxGSHT4 is completely failed
INFO:toil.leader:Only failed jobs and their dependents (1 total) are remaining, so exiting.
INFO:toil.leader:Finished the main loop
INFO:toil.leader:Waiting for stats and logging collator process to finish
INFO:toil.leader:Stats/logging finished collating in 0.261636972427 seconds
Traceback (most recent call last):
File "toy_code_for_unfulfilled_promise_error.py", line 95, in <module>
test_unfulfilled_promise_error_fail() # This will FAIL
File "toy_code_for_unfulfilled_promise_error.py", line 65, in test_unfulfilled_promise_error_fail
Job.Runner.startToil(F, params)
File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 372, in startToil
return mainLoop(config, batchSystem, jobStore, rootJob)
File "/usr/local/lib/python2.7/site-packages/toil/leader.py", line 505, in mainLoop
raise FailedJobsException( config.jobStore, totalFailedJobs )
toil.leader.FailedJobsException: The job store '/tmp/toy' contains 1 failed jobs
Promises are resolved by unplickling them during the unpickling of a job. IOW, nested unpickling. Hence the confusing error message.
In today's scrum it was decided that we split this up: this issue will be resolved by making a quick and dirty try-except and raising an exception with a more appropriate message. Benedict thinks that this can be caught even earlier by "adding a few more edges". That should be done in a separate issue which the resolver of this ticket shall create upon its resolution.
@arkal to triage.
If the statically defined child of a job F (that should be encapsulated but isn't) is passed a return value that was obtained from a dynamically defined child of F, then and EOFError is thrown. @hannes believes the error should rather be an unfulfilled promise error.
Toy code below reproduces error.