Closed Mandur closed 1 year ago
Sorry for the inconvenient. Please try to set "clusterPurpose=DevTest" this config to mitigate the issue.
Thank you for the fast reply, we will try it as soon as we get cluster access again!
Thank you @zetiaatgithub, it resolved the issue!
Hello,
We are trying to deploy AzureML to an arc enabled cluster of 3 nodes, 2 with GPUs and one without. We would like the training to only occur on the nodes with GPUs, therefore our understanding based on the doc is that we select need a nodelector (set to nodeSelector.nvidia\.com/gpu\.present=true'), effectively making our cluster a 2 node cluster.
When we set the value 'inferenceRouterHA' to false, the AzureML deployment is still failing. We see a pod called healthcheck failing with the following log (full pod logs attached):
Status: Failed Name: clusterresource ErrorCode: E40002 ErrorMessage: Insufficient healthy node
(We also tried setting another variable 'inferenceLoadbalancerHA' to false as pointed by some previous documentation and bugs with similar outcome)
We tried to update the Azure CLI and the AML k8s extension to latest versions When we describe the healthcheck pod we cannot see this setting as environment variables (attached).
healthcheck_pod_logs.txt.log
healthcheck.yaml.txt