Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
30 stars 12 forks source link

Could nodes automatically be set offline if they reboot #45

Open weshinsley opened 5 months ago

weshinsley commented 5 months ago

Feature Request Description

To be clear: these are not HPC Pack problems, but other windows OS problems that have an impact on our HPC use - and a feature request for HPC Pack that might help us cope with them.

Two problems:-

For various reasons, our cluster is full of Windows 10 Workstation nodes. We are running a fast infiniband application network, but there is a race-condition when Windows 10 boots up, in which SMB does not always start with RDMA enabled, This is solved by restarting the Workstation, HPCSoaDiagMon and Netlogon services - if we don't do that, then the machine effectively locks up forever if file access over the infiniband is attempted.

The second problem is that we have tried absolutely everything we can think of to prevent Windows Update from automatically rebooting these workstation nodes monthly, and still it does so - sometimes while jobs are still actually running. This is in our experience the main weakness of Win 10 workstation nodes - (licensing/cost issues prevent us using Win Server as compute nodes)

And the second problem triggers the first, so without us getting any notification, windows update reboots our workstation nodes, and it comes up with RDMA in an inconsistent state, so jobs take forever.

Describe Preferred Solution

We would love it if both of these solutions could be fixed upstream - but we don't know how to report them, and have not much confidence in a timely solution. But there is something simple that might be enough:-

If HPC Pack could spot when nodes have gone into the Error state (and have rebooted), and automatically make them offline and not run jobs on them when they return, then we could spot when this has happened, check the node, restart the services if necessary.

Describe Alternatives Considered

Windows Update: we are totally lost for options. We've tried Group Policy for the nodes (which are on a domain) set to the "Notify to download / notify to update" options, and we've tried disabling the automatic updates policy altogether - nothing makes any difference; the nodes clearly reboot and we see the Windows Update logs in event viewer. We have no further ideas on that one..

For the recovery, we've tried a Scheduled Task to restart those services - which nearly works, but we found we needed a delay of 1-2 minutes on boot - it varies. If you run the task too early, the restarting doesn't activate RDMA on the infiniband cards; too late, and HPC Pack might already have launched jobs on that node, which then fail if we restart the services, but lock up forever if we leave them.

YutongSun commented 4 months ago

@weshinsley , suppose there are multiple ways to disable Windows auto updates, e.g. through domain or local group policy or disabling the Windows update service. From HPC Pack side, there is a feature named Node Start Checker which could be used for this scenario. Please check the feature description below.

In certain situations when a compute node restarts, it is preferred to check a certain condition, i.e. Infiniband network IP is ready, before reporting heartbeats for job allocation. To achieve this, just adding the following registry keys and changing the NodeChecker.cmd under %CCP_HOME%Bin folder on the compute nodes. The node start checker would run NodeChecker.cmd with NodeCheckerTimeout (by default -1/infinite). If the exit code is non-zero or timeout occurs, it will rerun in NodeCheckerInterval (by default 10 seconds) for NodeCheckerCount (by default 3) in total. Note, no matter the final exit code is zero or not, the heartbeats will start for the node.

Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\HPC] "NodeCheckerCount"=dword:00000003 "NodeCheckerInterval"=dword:0000000a "NodeCheckerTimeout"=dword:0000003c