Closed dfayzur closed 5 years ago
Please consult the troubleshooting guide, particularly this section: https://github.com/Azure/batch-shipyard/blob/master/docs/96-troubleshooting-guide.md#compute-node-enters-start_task_failed-state
It would be helpful to understand the contents of the cascade.log files for the nodes with start task failed.
Closing due to no response. Please re-open if needed.
Problem Description
I have been using Azure Batch Shipyard with VMs of type STANDARD_NV6. Usually, I create a pool, submit one job per node and node gets new jobs after finishing one. I normally create 14 nodes everyday. But some of the nodes of 14 nodes always goes to
starttaskfailed
from the start of pool. Rest of the nodes run just fine.I am checking, if someone is having the same issue. Can some help me out on this issue?
We are having this issue after solving and rebuilding shipyard according to https://github.com/Azure/batch-shipyard/issues/291 . We did not notice this current problems before this https://github.com/Azure/batch-shipyard/issues/291
I tried to see the
start task execution
in both failed and running nodes, and found:Failed node with Exit code 1
Running node with Exit code 0
So, I tried to see the
stderr.txt
andstdout.txt
files of startup for running node and failed node. I found the following in both case:The
stderr.txt
details seem to be the same in both running and failed nodes:But I see differences in the
stdout.txt
file for running and failed node:Output of stdout.txt in failed node: ..................... Login Succeeded 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found.
Failed nodes stops output here, no error
Output of stdout.txt in running node: ..................... Login Succeeded 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found. 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found. 2019-07-30T06:20:32,401316707+00:00 - DEBUG - Cascade exited successfully 2019-07-30T06:20:32,404615013+00:00 - DEBUG - Block for Docker images: xxxxxx.azurecr.io/xxxxxxx:latest 2019-07-30T06:20:32,405445315+00:00 - DEBUG - Block for Singularity images: 2019-07-30T06:20:32,406299216+00:00 - INFO - blocking until Docker images ready: xxxxxx.azurecr.io/xxxxxxx:latest 2019-07-30T06:20:32,446556691+00:00 - INFO - all Docker images present 2019-07-30T06:20:32,448488994+00:00 - INFO - Prep completed