Azure / azurehpc

This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
MIT License
121 stars 64 forks source link

NVIDIA #20210102.1 Pipeline Failure #441

Open xpillons opened 3 years ago

xpillons commented 3 years ago
xpillons commented 3 years ago

Manually reran the pipeline. Gen2 passed. Gen1 failed with error Resource : gpumaster - OSProvisioningTimedOut Message : OS Provisioning for VM 'gpumaster' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle. None Allocating NV12s_v3 is taking too long

garvct commented 3 years ago

@xpillons, got a similar failure today running the nvidia pipiline. https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10563&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=280 The time difference between "build install scripts" and the rsync error was only 2 seconds. The error is a connection refused. I believe we already check thad sshd is running before trying to connect, but this does not fix the problem. If there is not a quick fix for this (i.e some additional flag), then maybe it would be worth the time to re-architect this (i.e. replace rsync with something else?). This type of error is occurring too often.

xpillons commented 3 years ago

@edwardsp can you have a look to check why the prsync is failing ? I can see in the code that ssh is tested upfront, but I'm not 100% sure about the sequence. Otherwise maybe we should add a retry in the rsyn python wrapper function

edwardsp commented 3 years ago

ssh isn't tested before the initial rsync so I have just added a PR to add a test for ssh.