Closed blairdrummond closed 2 years ago
@blairdrummond Do you have details on the exact errors, that will help trace it?
a) Which namespace(s) b) Which cluster(s) c) Protected B / Unclassified d) How many days experienced issue e) Exact time error happened
There have been many reports but here's one
cluster: aaw-prod-cc-00
namespace: nicholas-denis
classification: unclassified
time-to-connect: >10 minutes
The notebook name was classimbalance
the user started the notebook a bit before 10:55am, and I tested and was unable to connect, though I could exec -it into the running pod. I think they recreated the pod without me knowing while I was debugging the logs. I've had similar reports for a few days though the errors are usually transient
I'd be curious to know if the pod triggered a node scale up, it's possible that network policies that a while to fully reconcile on a new node (we saw that when we first implemented them). I tested a couple of Notebooks and had no issues accessing right away. I can try with bigger resources next week with the hope it triggers a scale up and see if I can reproduce the issue.
That's a really good idea, and it would be consistent with the fact that users reported this often on GPUs
@sylus @zachomedia , Nicholas Denis' notebook classimbalance-final
on Prod, a GPU notebook, has been READY for an hour, but still cannot be connected to due to a network issue. This is in the nicholas-denis
namespace.
Might be worth debugging to see if this is still slow reconciliation of the network policies
@blairdrummond can you add @zachomedia as a contributor so he can check?
Its likely our network policies issue but can check
This definitely looks like network policies. Unfortunately the case is with Microsoft.
From a conversation with Blair, there may be a related network policy issue with the GPU servers as when launched it creates a new pod for each GPU. Issue was reported by Nick Denis and source of issue still needs to be identified.
Need more time to look into GPU pod logs to determine the source of the issue. Will run another test this week to capture errors and/or behaviour
Stan Hatko 11:30 AM Starting a GPU node is not working for me. I created two GPU nodes, one tmp1 with non-persistent workspace and the other test3 with a persistent workspace. A screenshot is attached. After half an hour test3 has still not started. The tmp1 node started successfully but I could not connect to it, I get the "connection failure" error in the second screenshot.
CC @sylus
Created a GPU, which triggered a scale up automatically:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13m default-scheduler 0/73 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 73 Insufficient nvidia.com/gpu.
Warning FailedScheduling 13m default-scheduler 0/73 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 73 Insufficient nvidia.com/gpu.
Warning FailedScheduling 8m6s default-scheduler 0/74 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 74 Insufficient nvidia.com/gpu.
Normal Scheduled 7m46s default-scheduler Successfully assigned zachary-seguin/gpu-test-0 to aks-usergpuuc-17130176-vmss00003o
Normal TriggeredScaleUp 13m cluster-autoscaler pod triggered scale-up: [{aks-usergpuuc-17130176-vmss 5->6 (max: 10)}]
Normal SuccessfulAttachVolume 7m36s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-50f8ae6b-6c54-4eb3-9e85-1cea16e21e89"
Normal Pulling 7m26s kubelet Pulling image "docker.io/istio/proxyv2:1.5.10"
Normal Pulled 7m20s kubelet Successfully pulled image "docker.io/istio/proxyv2:1.5.10" in 5.819904983s
Normal Created 7m16s kubelet Created container istio-validation
Normal Started 7m16s kubelet Started container istio-validation
Normal Pulling 7m15s kubelet Pulling image "k8scc01covidacr.azurecr.io/jupyterlab-pytorch:1b5c3b61"
Normal Pulled 81s kubelet Successfully pulled image "k8scc01covidacr.azurecr.io/jupyterlab-pytorch:1b5c3b61" in 5m54.548143379s
Normal Created 78s kubelet Created container gpu-test
Normal Started 77s kubelet Started container gpu-test
Normal Pulled 77s kubelet Container image "docker.io/istio/proxyv2:1.5.10" already present on machine
Normal Created 76s kubelet Created container istio-proxy
Normal Started 76s kubelet Started container istio-proxy
Normal Pulling 76s kubelet Pulling image "vault:1.7.2"
Normal Pulled 72s kubelet Successfully pulled image "vault:1.7.2" in 4.579876535s
Normal Created 68s kubelet Created container vault-agent
Normal Started 68s kubelet Started container vault-agent
Node took approximately 6 minutes to be come ready (that is probably about as good as we'll get).
Disk mount took about 1 minute then image pull started.
From an information point, there are currently 7748 network policies in the cluster:
kubectl get networkpolicies -A --no-headers | wc -l
7748
And there are currently 2035 pods:
kubectl get pods -A --no-headers | wc -l
2035
Looking at the network policies, after about 30 minutes, just over half of the iptables rules are in place:
iptables-save | wc -l
18561
Looking at an existing node, we see:
iptables-save | wc -l
29589
The azure-npm pod has been maxed out on its CPU the whole time, so this delay is likely being caused because azure-npm is being heavily throttled during the initial reconciliation and application of the network policies rules on the node.
And now about 2 hours later the notebook works, so it just takes time.
Blocked by ticket with Microsoft at the moment. Keep us posted @sylus !
New limits were introduced by MS and have since improved greatly our network policy generation on server creation
We're getting very frequent messages that users cannot connect to their new notebooks and that they get the typical Istio block messages.
It usually seems to resolve itself eventually, but I think we need to figure something out here. I'm nervous that this might be an envoy performance issue because of the size of the cluster, but hopefully I'm wrong.
CC @brendangadd @sylus