Open rexwangcc opened 6 years ago
it doesn't have to be in scatter. Basically, workflows or tasks finished successfully but there was something unexpected (no error msg in any kind of log files) while cromwell was copying outputs into bucket. At the end, cromwell actually did not copy outputs but created empty files in bucket.
@jishuxu @rexwangcc -- have you updated your Cromwell version recently?
@jishuxu @rexwangcc I've asked other users about this behavior and haven't heard anything similar. If you've managed to reproduce this, please let me know. Otherwise I may close this issue in a few weeks. Thanks!
@ruchim I don't have much detailed information about this issue, but I know that @jishuxu ran into this problem for a lot of times from Cromwell v31 to now (CaaS-dev), not sure if she can reproduce this or point you to some other workflows.
A bit more info on this. The job mentioned above ran out of disk space. The monitoring.log is full of "out of space" errors. However, the job ran to completion and the output directory has an rc file containing 0, so Cromwell considered it a success. But the output files were truncated to zero bytes, presumably due to the disk space issue. Normally we get a hard failure when we run out of disk space but not in this case for some reason.
Thanks for investigating! The reason this task wasn't marked as fail because the tool exits with a 0 return code -- and hence Cromwell marks it as success. The only way Cromwell understands the command for a job failed is the value of the return code. Is there any chance this tool can return a non-zero exit code when running out of disk?
@jishuxu
@ruchim I'm not sure about the details, we have a monitor script (https://github.com/HumanCellAtlas/pipeline-tools/blob/c6c11a20c91aa360fcd7ca7c28de14b281cabd7b/adapter_pipelines/ss2_single_sample/options.json#L2) running as workflow options besides the actual RSEM tool, which is monitoring the disk space. it outputs:
/cromwell_root/monitoring.sh: line 15: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 17: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 19: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 13: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 15: echo: write error: No space left on device
/cromwell_root/monitoring.sh: line 17: echo: write error: No space left on device
but not exit codes. Do you think it's possible to add some error handling to that bash script to let cromwell know the out of space error during the runtime? Even if it's practical to do that, it may still not as safe as the exit code throw by the actual tool. so wait for @jishuxu's response.
Cromwell Version:
"cromwell": "33-e90c4de"
Problem: The workflow has scattered tasks, a few of the shards finished without any errors, but when looking into the actual results of the task, we can only see files with
0B
size.Example workflow: workflow
42e173c6-7fc3-4a3e-93c7-c9d95836f6a5
inhttps://cromwell.mint-dev.broadinstitute.org/
, specifically, the task:call-sc/shard-98/SmartSeq2SingleCell/b4ac422c-e5b1-42ed-8dcf-cca51394e08c/call-RSEMExpression
, shard-98@jishuxu has run into this issue for several times.