broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 358 forks source link

Scattered workflow finished without error, but the output file is empty in the bucket #4006

Open rexwangcc opened 6 years ago

rexwangcc commented 6 years ago

Cromwell Version: "cromwell": "33-e90c4de"

Problem: The workflow has scattered tasks, a few of the shards finished without any errors, but when looking into the actual results of the task, we can only see files with 0B size.

Example workflow: workflow 42e173c6-7fc3-4a3e-93c7-c9d95836f6a5 in https://cromwell.mint-dev.broadinstitute.org/, specifically, the task: call-sc/shard-98/SmartSeq2SingleCell/b4ac422c-e5b1-42ed-8dcf-cca51394e08c/call-RSEMExpression, shard-98

@jishuxu has run into this issue for several times.

jishuxu commented 6 years ago

it doesn't have to be in scatter. Basically, workflows or tasks finished successfully but there was something unexpected (no error msg in any kind of log files) while cromwell was copying outputs into bucket. At the end, cromwell actually did not copy outputs but created empty files in bucket.

ruchim commented 6 years ago

@jishuxu @rexwangcc -- have you updated your Cromwell version recently?

ruchim commented 6 years ago

@jishuxu @rexwangcc I've asked other users about this behavior and haven't heard anything similar. If you've managed to reproduce this, please let me know. Otherwise I may close this issue in a few weeks. Thanks!

rexwangcc commented 6 years ago

@ruchim I don't have much detailed information about this issue, but I know that @jishuxu ran into this problem for a lot of times from Cromwell v31 to now (CaaS-dev), not sure if she can reproduce this or point you to some other workflows.

dshiga commented 6 years ago

A bit more info on this. The job mentioned above ran out of disk space. The monitoring.log is full of "out of space" errors. However, the job ran to completion and the output directory has an rc file containing 0, so Cromwell considered it a success. But the output files were truncated to zero bytes, presumably due to the disk space issue. Normally we get a hard failure when we run out of disk space but not in this case for some reason.

ruchim commented 6 years ago

Thanks for investigating! The reason this task wasn't marked as fail because the tool exits with a 0 return code -- and hence Cromwell marks it as success. The only way Cromwell understands the command for a job failed is the value of the return code. Is there any chance this tool can return a non-zero exit code when running out of disk?

rexwangcc commented 6 years ago

@jishuxu

rexwangcc commented 6 years ago

@ruchim I'm not sure about the details, we have a monitor script (https://github.com/HumanCellAtlas/pipeline-tools/blob/c6c11a20c91aa360fcd7ca7c28de14b281cabd7b/adapter_pipelines/ss2_single_sample/options.json#L2) running as workflow options besides the actual RSEM tool, which is monitoring the disk space. it outputs:

/cromwell_root/monitoring.sh: line 15: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 17: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 19: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 13: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 15: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 17: echo: write error: No space left on device

but not exit codes. Do you think it's possible to add some error handling to that bash script to let cromwell know the out of space error during the runtime? Even if it's practical to do that, it may still not as safe as the exit code throw by the actual tool. so wait for @jishuxu's response.