Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Nodes in pool in state `starttaskfailed` #295

Closed dfayzur closed 5 years ago

dfayzur commented 5 years ago

Problem Description

I have been using Azure Batch Shipyard with VMs of type STANDARD_NV6. Usually, I create a pool, submit one job per node and node gets new jobs after finishing one. I normally create 14 nodes everyday. But some of the nodes of 14 nodes always goes to starttaskfailed from the start of pool. Rest of the nodes run just fine.

I am checking, if someone is having the same issue. Can some help me out on this issue?

We are having this issue after solving and rebuilding shipyard according to https://github.com/Azure/batch-shipyard/issues/291 . We did not notice this current problems before this https://github.com/Azure/batch-shipyard/issues/291

image

I tried to see the start task execution in both failed and running nodes, and found:

Failed node with Exit code 1 image

Running node with Exit code 0 image

So, I tried to see the stderr.txt and stdout.txt files of startup for running node and failed node. I found the following in both case:

The stderr.txt details seem to be the same in both running and failed nodes:

image

But I see differences in the stdout.txt file for running and failed node:

Output of stdout.txt in failed node: ..................... Login Succeeded 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found.

Failed nodes stops output here, no error

Output of stdout.txt in running node: ..................... Login Succeeded 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found. 2019-07-30T06:17:27UTC - INFO - Docker registry logins completed. 2019-07-30T06:17:27UTC - WARNING - No Singularity registry servers found. 2019-07-30T06:20:32,401316707+00:00 - DEBUG - Cascade exited successfully 2019-07-30T06:20:32,404615013+00:00 - DEBUG - Block for Docker images: xxxxxx.azurecr.io/xxxxxxx:latest 2019-07-30T06:20:32,405445315+00:00 - DEBUG - Block for Singularity images: 2019-07-30T06:20:32,406299216+00:00 - INFO - blocking until Docker images ready: xxxxxx.azurecr.io/xxxxxxx:latest 2019-07-30T06:20:32,446556691+00:00 - INFO - all Docker images present 2019-07-30T06:20:32,448488994+00:00 - INFO - Prep completed

alfpark commented 5 years ago

Please consult the troubleshooting guide, particularly this section: https://github.com/Azure/batch-shipyard/blob/master/docs/96-troubleshooting-guide.md#compute-node-enters-start_task_failed-state

It would be helpful to understand the contents of the cascade.log files for the nodes with start task failed.

alfpark commented 5 years ago

Closing due to no response. Please re-open if needed.