Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
29 stars 11 forks source link

Node State and Node Health stuck #41

Open dzionito opened 5 months ago

dzionito commented 5 months ago

Problem Description

We are using auto start/stop PS script which powers on/off the AWS EC2 instances then they are/aren't needed. Periodically some nodes "stuck" with "Node State" - Starting or Draining and "Node Health" - Transitional. These nodes could not be started or stopped by script anymore and they need to be removed and readded to the cluster. For example at the moment one of the node which is highlighted in printscreens, was powered on, but can't be powered off by script as it's shows Starting Transitional. As there is now active jobs at the moment, this node wasting our money.

Steps to Reproduce

Power on and off the Compute node in the cluster.

Expected Results

Node State - "Online" and Node Health - "Ok" then node is running. Node State - "Offline" and Node Health - "Error" then node is powered off.

Actual Results

image

YutongSun commented 1 month ago

Which version of HPC Pack is this? What's the auto start/stop script?