common-workflow-language / workflow-service

Implementation of the GA4GH Workflow Execution Service, a REST service for running workflows
Apache License 2.0
36 stars 21 forks source link

Incorrect exit code / status for CWL run with Toil backend #58

Open jaeddy opened 6 years ago

jaeddy commented 6 years ago

When running the md5sum.cwl workflow using Toil, the job seems to hang indefinitely in "RUNNING" state. Looking at the run log, the workflow appears to have completed successfully, but the exit code is -1.

Note: this occurred after installing from master, not my fork/branch used in #57.

$ wes-server --backend=wes_service.toil_wes
$ wes-client --host=localhost:8080 --proto=http --attachments="testdata/dockstore-tool-md5sum.cwl,testdata/md5sum.input" testdata/md5sum.cwl testdata/md5sum.cwl.json
INFO:root:Workflow run id is 1851f4f73ffb444ba0f33eefc9145c34

(eventually had to keyboard interrupt out of this step...)

$ wes-client --host=localhost:8080 --proto=http --get 1851f4f73ffb444ba0f33eefc9145c34
{
    "run_id": "1851f4f73ffb444ba0f33eefc9145c34",
    "workflow_log": {
        "stdout": "",
        "start_time": "1536268000.05",
        "cmd": [
            "['toil-cwl-runner', '--outdir=/Users/jaeddy/workflows/1851f4f73ffb444ba0f33eefc9145c34/outdir', '--jobStore=file:/Users/jaeddy/workflows/1851f4f73ffb444ba0f33eefc9145c34/toiljobstore', '/var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_workflow.cwl', '/var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_input.json']"
        ],
        "exit_code": -1,
        "end_time": "1536268000.06",
        "stderr": "C02SW7RPG8WN 2018-09-06 14:06:41,145 MainThread INFO toil.lib.bioio: Root logger is at level 'INFO', 'toil' logger at level 'INFO'.\nC02SW7RPG8WN 2018-09-06 14:06:41,188 MainThread INFO toil.jobStores.abstractJobStore: The workflow ID is: '2a94ac86-23b0-4dc1-8645-c7cc24a2664e'\nResolved '/var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_workflow.cwl' to 'file:///var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_workflow.cwl'\nC02SW7RPG8WN 2018-09-06 14:06:41,191 MainThread INFO cwltool: Resolved '/var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_workflow.cwl' to 'file:///var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/wes_workflow.cwl'\nC02SW7RPG8WN 2018-09-06 14:06:42,525 MainThread INFO toil.common: Using the single machine batch system\nC02SW7RPG8WN 2018-09-06 14:06:42,525 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxCores to CPU count of system (8).\nC02SW7RPG8WN 2018-09-06 14:06:42,525 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxMemory to physically available memory (17179869184).\nC02SW7RPG8WN 2018-09-06 14:06:42,538 MainThread INFO toil.common: Created the workflow directory at /var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/toil-2a94ac86-23b0-4dc1-8645-c7cc24a2664e-83779605114901\nC02SW7RPG8WN 2018-09-06 14:06:42,539 MainThread WARNING toil.batchSystems.singleMachine: Limiting maxDisk to physically available disk (25154871296).\nC02SW7RPG8WN 2018-09-06 14:06:42,550 MainThread INFO toil.common: User script ModuleDescriptor(dirPath='/Users/jaeddy/anaconda/envs/synorchestrator/lib/python2.7/site-packages', name='toil.cwl.cwltoil', fromVirtualEnv=False) belongs to Toil. No need to auto-deploy it.\nC02SW7RPG8WN 2018-09-06 14:06:42,550 MainThread INFO toil.common: No user script to auto-deploy.\nC02SW7RPG8WN 2018-09-06 14:06:42,551 MainThread INFO toil.common: Written the environment for the jobs to the environment file\nC02SW7RPG8WN 2018-09-06 14:06:42,551 MainThread INFO toil.common: Caching all jobs in job store\nC02SW7RPG8WN 2018-09-06 14:06:42,551 MainThread INFO toil.common: 0 jobs downloaded.\nC02SW7RPG8WN 2018-09-06 14:06:42,573 MainThread INFO toil: Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.\nC02SW7RPG8WN 2018-09-06 14:06:42,573 MainThread INFO toil.realtimeLogger: Real-time logging disabled\nC02SW7RPG8WN 2018-09-06 14:06:42,590 MainThread INFO toil.toilState: (Re)building internal scheduler state\nC02SW7RPG8WN 2018-09-06 14:06:42,590 MainThread INFO toil.leader: Found 1 jobs to start and 0 jobs with successors to run\nC02SW7RPG8WN 2018-09-06 14:06:42,591 MainThread INFO toil.leader: Checked batch system has no running jobs and no updated jobs\nC02SW7RPG8WN 2018-09-06 14:06:42,591 MainThread INFO toil.leader: Starting the main loop\nC02SW7RPG8WN 2018-09-06 14:06:45,407 MainThread INFO toil.leader: Finished the main loop: no jobs left to run\nC02SW7RPG8WN 2018-09-06 14:06:45,408 MainThread INFO toil.serviceManager: Waiting for service manager thread to finish ...\nC02SW7RPG8WN 2018-09-06 14:06:45,602 MainThread INFO toil.serviceManager: ... finished shutting down the service manager. Took 0.193918943405 seconds\nC02SW7RPG8WN 2018-09-06 14:06:45,603 MainThread INFO toil.statsAndLogging: Waiting for stats and logging collator thread to finish ...\nC02SW7RPG8WN 2018-09-06 14:06:45,622 MainThread INFO toil.statsAndLogging: ... finished collating stats and logs. Took 0.0194990634918 seconds\nC02SW7RPG8WN 2018-09-06 14:06:45,622 MainThread INFO toil.leader: Finished toil run successfully\nC02SW7RPG8WN 2018-09-06 14:06:45,641 MainThread INFO toil.common: Successfully deleted the job store: <toil.jobStores.fileJobStore.FileJobStore object at 0x10aa34b10>\n"
    },
    "outputs": {},
    "request": {
        "workflow_url": "file:///var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU/md5sum.cwl",
        "workflow_attachment": "file:///var/folders/xt/2mpzwr2972g3_yqgjz5dbsg40000gq/T/tmpF3dDPU",
        "workflow_params": {
            "input_file": {
                "path": "md5sum.input",
                "class": "File"
            }
        },
        "workflow_type_version": "v1.0",
        "workflow_type": "CWL"
    },
    "state": "RUNNING",
    "task_logs": []
}
DailyDreaming commented 6 years ago

Most of my testing was done with WDL. CWL often doesn't print the relevant "success" line in its logs that status is checking for: https://github.com/common-workflow-language/workflow-service/blob/master/wes_service/toil_wes.py#L263

This was written and merged to address this: https://github.com/DataBiosphere/toil/pull/2339

Integrating the above toil status check into the toil backend should fix things.