inferenceRouterHA does't seems to enable deploying the cluster to less than three nodes

Mandur commented 1 year ago

Hello,

We are trying to deploy AzureML to an arc enabled cluster of 3 nodes, 2 with GPUs and one without. We would like the training to only occur on the nodes with GPUs, therefore our understanding based on the doc is that we select need a nodelector (set to nodeSelector.nvidia\.com/gpu\.present=true'), effectively making our cluster a 2 node cluster.

When we set the value 'inferenceRouterHA' to false, the AzureML deployment is still failing. We see a pod called healthcheck failing with the following log (full pod logs attached):

Status: Failed Name: clusterresource ErrorCode: E40002 ErrorMessage: Insufficient healthy node

(We also tried setting another variable 'inferenceLoadbalancerHA' to false as pointed by some previous documentation and bugs with similar outcome)

We tried to update the Azure CLI and the AML k8s extension to latest versions When we describe the healthcheck pod we cannot see this setting as environment variables (attached).

healthcheck_pod_logs.txt.log

healthcheck.yaml.txt

zetiaatgithub commented 1 year ago

Sorry for the inconvenient. Please try to set "clusterPurpose=DevTest" this config to mitigate the issue.

Mandur commented 1 year ago

Thank you for the fast reply, we will try it as soon as we get cluster access again!

Mandur commented 1 year ago

Thank you @zetiaatgithub, it resolved the issue!

Azure / AML-Kubernetes

inferenceRouterHA does't seems to enable deploying the cluster to less than three nodes #271