DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
900 stars 240 forks source link

Better error needed if job is not encapsulated by the user #517

Closed arkal closed 7 years ago

arkal commented 9 years ago

If the statically defined child of a job F (that should be encapsulated but isn't) is passed a return value that was obtained from a dynamically defined child of F, then and EOFError is thrown. @hannes believes the error should rather be an unfulfilled promise error.

Toy code below reproduces error.

#!/usr/bin/env python2.7
'''
Author : Arjun Arkal Rao
Affiliation : UCSC BME, UCSC Genomics Institute
File : toy_code_for_unfulfilled_promise_error.py
'''
from __future__ import print_function
from toil.job import Job

import argparse
import time
import os

def sbe_job(job):
    '''
    This is a job that Should Be Encapsulated (SBE) for the logic to work.
    It will spawn 3 children, A, B and C that will each sleep for 5 seconds
    after which they will return their jobname.  The promises are store in dicts
    with the key=jobname and value=PromisedJobReturnValue
    '''
    return_dict = {}
    for jobname in ['A', 'B', 'C']:
        return_dict[jobname] = job.addChildJobFn(sleepy, jobname).rv()
    return return_dict

def sleepy(job, return_val):
    '''
    This function will sleep for 5 seconds and then return the value it was
    passed.
    '''
    time.sleep(5)
    return return_val

def dopey(job, test_val):
    '''
    This is a completely useless function.  It's only task is to be a child job
    of SBE_JOB.
    '''
    return None

def test_unfulfilled_promise_error_fail():
    '''
    This will run the code that fails. F should be encapsulated but it isn't.
    The DAG looks like
             F
        _____|_____ 
       /    /  \   \
      A     B   C   G
    since A, B and C are all sibling jobs and share the same parent as G.

    G is being passed the return values of A, B and C through F.rv() and since
    A, B and C sleep for 5 seconds, G starts running before they start and gets
    an EOFError when it whould really be returning an unfulfilled promise error.
    '''
    parser = argparse.ArgumentParser()
    parser.add_argument('--dummy', dest='dummy', default=None, required=False)
    Job.Runner.addToilOptions(parser)
    params = parser.parse_args()
    F = Job.wrapJobFn(sbe_job)
    G = Job.wrapJobFn(dopey, F.rv())
    F.addChild(G)
    Job.Runner.startToil(F, params)

def test_unfulfilled_promise_error_pass():
    '''
    This will run the code that passes. F should be encapsulated and is.
    The DAG looks like
          _____________
         |      A      |
         |   ___|___   |
         |  /   |   \  |
         | B    C    D |
         |_____________|
                |
                X

    G is being passed the return values of A, B and C through F.rv().  This time
    A, B and C sleep finish running before G starts running since they are
    encapsulated within A -- Hence no error is thrown.
    '''
    parser = argparse.ArgumentParser()
    parser.add_argument('--dummy', dest='dummy', default=None, required=False)
    Job.Runner.addToilOptions(parser)
    params = parser.parse_args()
    F = Job.wrapJobFn(sbe_job).encapsulate()
    G = Job.wrapJobFn(dopey, F.rv())
    F.addChild(G)
    Job.Runner.startToil(F, params)

if __name__ == '__main__':
    test_unfulfilled_promise_error_fail()  # This will FAIL
    test_unfulfilled_promise_error_pass()  # This will PASS
hannes-ucsc commented 9 years ago

The EOFError occurs during unpickling of a job and it is completely baffling. @arkal, if you happen to have a stack trace, can you paste it here?

The problem might be more wide-spread than what the title suggests. @jpfeil ran into this, too, in a slightly different scenario. The general condition is that promises are wired in a way that is inconsistent with the job wiring, leading to a failure of promise resolution because the job producing the value for a promise hasn't been run yet.

arkal commented 9 years ago

@hannes, here's the stack trace

[centos@arjun-fake-podk toil_tests]$ python toy_code_for_unfulfilled_promise_error.py /tmp/toy
No handlers could be found for logger "toil.resource"
INFO:toil.lib.bioio:Logging set at level: INFO
INFO:toil.common:Using the single machine batch system
INFO:toil.jobStores.fileJobStore:Jobstore directory is: /tmp/toy
WARNING:toil.batchSystems.singleMachine:Limiting maxCores to CPU count of system (8).
INFO:toil.batchSystems.singleMachine:Setting up the thread pool with 80 workers, given a minimum CPU fraction of 0.100000 and a maximum CPU value of 8.
INFO:toil.common:Written the environment for the jobs to the environment file
WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py').
WARNING:toil.resource:Can't globalize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py').
INFO:toil.leader:Checked batch system has no running jobs and no updated jobs
INFO:toil.leader:Found 1 jobs to start and 0 jobs with successors to run
INFO:toil.leader:Starting the main loop
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy s/M/jobFyhHec'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy v/n/job1CCeE5'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy c/6/jobLkkyWy'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy O/0/jobMUjEkT'.
INFO:toil.batchSystems.singleMachine:Executing command: '/usr/local/bin/python -E /usr/local/lib/python2.7/site-packages/toil/worker.py /tmp/toy c/i/jobxGSHT4'.
WARNING:toil.leader:The jobWrapper seems to have left a log file, indicating failure: c/i/jobxGSHT4
WARNING:toil.leader:Reporting file: c/i/jobxGSHT4
WARNING:toil.leader:c/i/jobxGSHT4:  ---TOIL WORKER OUTPUT LOG---
WARNING:toil.leader:c/i/jobxGSHT4:  WARNING:toil.resource:Can't find resource for leader path '/data/test_docker/toil_tests/toy_code_for_unfulfilled_promise_error.py'
WARNING:toil.leader:c/i/jobxGSHT4:  WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/data/test_docker/toil_tests', name='toy_code_for_unfulfilled_promise_error', extension='.py')
WARNING:toil.leader:c/i/jobxGSHT4:  INFO:toil.jobStores.fileJobStore:Jobstore directory is: /tmp/toy
WARNING:toil.leader:c/i/jobxGSHT4:  Traceback (most recent call last):
WARNING:toil.leader:c/i/jobxGSHT4:    File "/usr/local/lib/python2.7/site-packages/toil/worker.py", line 272, in main
WARNING:toil.leader:c/i/jobxGSHT4:      job = Job._loadJob(jobWrapper.command, jobStore)
WARNING:toil.leader:c/i/jobxGSHT4:    File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 835, in _loadJob
WARNING:toil.leader:c/i/jobxGSHT4:      return cls._unpickle(userModule, fileHandle)
WARNING:toil.leader:c/i/jobxGSHT4:    File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 856, in _unpickle
WARNING:toil.leader:c/i/jobxGSHT4:      return unpickler.load()
WARNING:toil.leader:c/i/jobxGSHT4:  EOFError
WARNING:toil.leader:c/i/jobxGSHT4:  Exiting the worker because of a failed jobWrapper on host arjun-fake-podk.novalocal
WARNING:toil.leader:c/i/jobxGSHT4:  ERROR:__main__:Exiting the worker because of a failed jobWrapper on host arjun-fake-podk.novalocal
WARNING:toil.leader:c/i/jobxGSHT4:  WARNING:toil.jobWrapper:Due to failure we are reducing the remaining retry count of job c/i/jobxGSHT4 to 0
WARNING:toil.leader:Job: c/i/jobxGSHT4 is completely failed
INFO:toil.leader:Only failed jobs and their dependents (1 total) are remaining, so exiting.
INFO:toil.leader:Finished the main loop
INFO:toil.leader:Waiting for stats and logging collator process to finish
INFO:toil.leader:Stats/logging finished collating in 0.261636972427 seconds
Traceback (most recent call last):
  File "toy_code_for_unfulfilled_promise_error.py", line 95, in <module>
    test_unfulfilled_promise_error_fail()  # This will FAIL
  File "toy_code_for_unfulfilled_promise_error.py", line 65, in test_unfulfilled_promise_error_fail
    Job.Runner.startToil(F, params)
  File "/usr/local/lib/python2.7/site-packages/toil/job.py", line 372, in startToil
    return mainLoop(config, batchSystem, jobStore, rootJob)
  File "/usr/local/lib/python2.7/site-packages/toil/leader.py", line 505, in mainLoop
    raise FailedJobsException( config.jobStore, totalFailedJobs )
toil.leader.FailedJobsException: The job store '/tmp/toy' contains 1 failed jobs
hannes-ucsc commented 9 years ago

Promises are resolved by unplickling them during the unpickling of a job. IOW, nested unpickling. Hence the confusing error message.

In today's scrum it was decided that we split this up: this issue will be resolved by making a quick and dirty try-except and raising an exception with a more appropriate message. Benedict thinks that this can be caught even earlier by "adding a few more edges". That should be done in a separate issue which the resolver of this ticket shall create upon its resolution.

hannes-ucsc commented 8 years ago

@arkal to triage.