VMs are deallocated and deleted too early

CamiloTerevinto commented 3 years ago

I've been working on a POC with Azure CycleCloud and HPC Pack 2019. From the head node, the auto-scaling configuration looks like this:

{
  "archivefile": "C:\\cycle\\jetpack\\config\\autoscaler_archive.txt",
  "boot_timeout": 1500,
  "cluster_name": "TEST-HPC",
  "default_resources": [],
  "disable_default_resources": false,
  "idle_timeout": 900,
  "lock_file": "C:\\cycle\\jetpack\\config\\scalelib.lock",
  "password": "*********",
  "statefile": "C:\\cycle\\jetpack\\config\\autoscaler_state.txt",
  "url": "https://172.17.10.4:9443",
  "username": "cyclecloud_access",
  "autoscale": {
    "start_enabled": true,
    "vm_retention_days": 7
  },
  "hpcpack": {
    "hn_hostname": "localhost",
    "pem": "C:\\cycle\\jetpack\\config\\hpc-comm.pem"
  },
  "logging": {
    "config_file": "C:\\cycle\\jetpack\\config\\autoscale_logging.conf"
  },
  "pbspro": {
    "read_only_resources": [
      "host",
      "vnode"
    ]
  }
}

However, when I start a new job, which starts a new node, what I see is:

The VM is deallocated after around 5 minutes, regardless of when the last job finished (once it started deallocating less than 30 seconds after finishing the last job).
The VM is deleted a matter of minutes after being deallocated (don't have an exact amount of time, but less than 30 minutes), even when the configuration says 7 days.

What am I doing wrong, or what could be wrongly configured, that would result in this situation?

CamiloTerevinto commented 3 years ago

It's been twice now that a few hours after the last VM is deleted (again, before it should be deleted) that the entire scale set is deleted by the cyclecloud or HPC Pack VM.

CamiloTerevinto commented 3 years ago

Adding more information:

Disabling the scheduled task Cyclecloud-HPC-Autoscaler and disabling "Decrease resources automatically (shrink)..." within HPC Pack does not prevent the nodes from being turned off, so I have no clue what else is at play here.
All the cn VMs started end up with this issue:

Mixlib::ShellOut::ShellCommandFailed: powershell_script[assign-NodeTemplate] (hpcpack::cn line 110) had an error: Mixlib::ShellOut::ShellCommandFailed: Command execution failed. STDOUT/STDERR suppressed for sensitive resource

Looking at the logs, I can see at least 6 errors, 4 of which are unique, but I don't know which one could actually be responsible for the error above.

Azure / cyclecloud-hpcpack

VMs are deallocated and deleted too early #10