Azure / AKS-Edge

Welcome to the Azure Kubernetes Service (AKS) Edge repo.
MIT License
56 stars 37 forks source link

[Bug]/[Question] Windows node fail to start - svclb stuck in pending - 401 error when pulling SVCLB image #103

Closed nadavsinai-philips closed 1 year ago

nadavsinai-philips commented 1 year ago

Describe scenario I tried a fresh AKS Edge install - SingleNode, then added the windows node as per the instructions here I then tried to add a windows deployement and I found it is stuck in ContainerCreating. with further debugging I found that although the node is "Ready" the svclb in kube-system is also stuck in ImagePullBackErr status due to the following error Failed to pull image "aksiotdevacr.azurecr.io/rancher/klipper-lb:v0.3.5": rpc error: code = Unknown desc = failed to pull and unpack image "aksiotdevacr.azurecr.io/rancher/klipper-lb:v0.3.5": failed to resolve reference "aksiotdevacr.azurecr.io/rancher/klipper-lb:v0.3.5": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized

I have no ARC subscriptions and the licensing I am thinking to use (if this POC proves relevant) is volume licensing since the installations of our product will be in air-gap environments without internet connectivity.

Question Is the windows Node feature of AKS Edge relevant without ARC subscription?

fcabrera23 commented 1 year ago

Hi @nadavsinai-philips,

Thanks for reaching out. We were able to repro your issue with a specific configuration: K3s + Windows node + ServiceIpRangeSize = 0. To fix it, just modify your deployment and increase the ServiceIpRangeSize (e.g equal 10) so you can have available IP addresses assigned to your LoadBalancer services.

We will also fix the failed pulling in our updates. If no ServiceIpRangeSize is defined, we should not try assigning a ServceIP and fail to deploy correctly the service.

Thanks, Francisco

nadavsinai-philips commented 1 year ago

hi @fcabrera23 , thanks for your reply. I recreated the cluster with the ServiceIpRangeSize set to 10. The cluster is created but again I could not run windows workload the problem is now


FailedCreatePodSandBox  35s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8fd3f4ea4b02a5d429e34cf1bf8210ae1b6a8643e5477cc2211b3b2aeb454ca9": plugin type="flannel" failed (add): failed to delegate add: error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410}```
fcabrera23 commented 1 year ago

Hi @nadavsinai-philips,

Did you wait ~10-20 minutes? It will give an error but then will retry and download the container (which takes some time) and schedule it correctly. Could you please confirm?

nadavsinai-philips commented 1 year ago

Thanks! indeed, if I wait long enough the cluster recovers and manages to create the svclb for the windows node perhaps a change in docs needs to reflect the time it can take