[BUG] Unable to enable Azure Arc Monitoring due to image pull issues from aksiotdevacr.azurecr.io ACR

gshiva commented 8 months ago

Describe the bug I enabled Azure Arc Monitoring integration via the Azure Portal. I see hundreds of pods in various error states. I recreated a pod definition and launched it to debug the issue. It is failing because it is unable to pull the image from aksiotdevacr.azurecr.io.

To Reproduce Steps to reproduce the behavior:

Download test-aks-pull.json
Run kubectl apply -f .\test-aks-pull.json
Run kubectl get events
See error

3s          Warning   FailedCreatePodSandBox   pod/resource-sync-agent-test-aks-pull   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "aksiotdevacr.azurecr.io/pause:3.9": failed to pull image "aksiotdevacr.azurecr.io/pause:3.9": failed to pull and unpack image "aksiotdevacr.azurecr.io/pause:3.9": failed to resolve reference "aksiotdevacr.azurecr.io/pause:3.9": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized

Expected behavior The pod should be launched without any error.

Environment (please complete the following information):

AKS Edge Essentials Version - 1.5.203.0
Kubernetes version 1.26.6
Windows Host OS (please complete the following information):
- Edition: Data Center Server
- Version: 10.0.20348 Build 20348
- Virtual Machine: None

Additional context I see the same error in multiple azure-arc pods. pause pod is just an example. The machine is behind a corp proxy. Not sure if that is an issue.

PS C:\Aks_Edge_Essentials> Test-AksEdgeArcConnection
[11/27/2023 17:56:23] Exception Caught!!!

 - Could not run kubectl on Windows node - node may not be reachable or cluster may be in bad state. Error was: ssh  failed to execute [Error from server (Forbidden): namespaces is forbidden: User "system:node:win-cvqbhhj1265-wedge" cannot list resource "namespaces" in API group "" at the cluster scope] (AksEdge.psm1: line 9019)
False

gshiva commented 7 months ago

I recreated the cluster and it is working now. One difference is that I configured monitoring as soon as I created the cluster and did not wait for days to configure the monitoring. Another change I made is to give the Linux Node 8GB memory instead of the default 4GB. Not sure if those were the issue.

I will close this in a week if there is no response from the MS team.

Vicent8899 commented 6 months ago

Issues with pulling the images from aksiotdevacr.azurecr.io could indicate problems. Specially if the linux nodes resources are thin.

We believe it can be reproduced by allocating the default memory and storage space (4GB/10GB respectively) to the Linux node and then turning on Azure Arc monitoring. We can see that the disk space used is 18GB and memory used is 5.4GB.

Suggestions:

Update documentation to state clearly that if you want Azure Arc Monitoring then to bump up the memory and storage allocation
Update the script to default to 8GB and at least 20GB storage for the Linux node.

Once the resource constraints are removed, we dont see much of error messages for pulling images from aksiotdevacr.azurecr.io.

rcheeran commented 6 months ago

Thanks for the update. Yes, if you need to use Arc and other Arc-extensions, you need a minimum of 8GB. See this

We are working on reducing this footprint.

Azure / AKS-Edge

[BUG] Unable to enable Azure Arc Monitoring due to image pull issues from aksiotdevacr.azurecr.io ACR #156