StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
68 stars 12 forks source link

Network errors while waiting for network policies on new VM creation #680

Closed blairdrummond closed 2 years ago

blairdrummond commented 3 years ago

We're getting very frequent messages that users cannot connect to their new notebooks and that they get the typical Istio block messages.

It usually seems to resolve itself eventually, but I think we need to figure something out here. I'm nervous that this might be an envoy performance issue because of the size of the cluster, but hopefully I'm wrong.

CC @brendangadd @sylus

zachomedia commented 3 years ago

@blairdrummond Do you have details on the exact errors, that will help trace it?

sylus commented 3 years ago

a) Which namespace(s) b) Which cluster(s) c) Protected B / Unclassified d) How many days experienced issue e) Exact time error happened

blairdrummond commented 3 years ago

There have been many reports but here's one

cluster: aaw-prod-cc-00
namespace: nicholas-denis
classification: unclassified
time-to-connect: >10 minutes

The notebook name was classimbalance

the user started the notebook a bit before 10:55am, and I tested and was unable to connect, though I could exec -it into the running pod. I think they recreated the pod without me knowing while I was debugging the logs. I've had similar reports for a few days though the errors are usually transient

image

zachomedia commented 3 years ago

I'd be curious to know if the pod triggered a node scale up, it's possible that network policies that a while to fully reconcile on a new node (we saw that when we first implemented them). I tested a couple of Notebooks and had no issues accessing right away. I can try with bigger resources next week with the hope it triggers a scale up and see if I can reproduce the issue.

blairdrummond commented 3 years ago

That's a really good idea, and it would be consistent with the fact that users reported this often on GPUs

blairdrummond commented 3 years ago

@sylus @zachomedia , Nicholas Denis' notebook classimbalance-final on Prod, a GPU notebook, has been READY for an hour, but still cannot be connected to due to a network issue. This is in the nicholas-denis namespace.

Might be worth debugging to see if this is still slow reconciliation of the network policies

sylus commented 3 years ago

@blairdrummond can you add @zachomedia as a contributor so he can check?

Its likely our network policies issue but can check

zachomedia commented 3 years ago

This definitely looks like network policies. Unfortunately the case is with Microsoft.

chuckbelisle commented 2 years ago

From a conversation with Blair, there may be a related network policy issue with the GPU servers as when launched it creates a new pod for each GPU. Issue was reported by Nick Denis and source of issue still needs to be identified.

chuckbelisle commented 2 years ago

Need more time to look into GPU pod logs to determine the source of the issue. Will run another test this week to capture errors and/or behaviour

blairdrummond commented 2 years ago

Stan Hatko 11:30 AM Starting a GPU node is not working for me. I created two GPU nodes, one tmp1 with non-persistent workspace and the other test3 with a persistent workspace. A screenshot is attached. After half an hour test3 has still not started. The tmp1 node started successfully but I could not connect to it, I get the "connection failure" error in the second screenshot.

image image (1)

CC @sylus

zachomedia commented 2 years ago

Created a GPU, which triggered a scale up automatically:

Events:
  Type     Reason                  Age    From                     Message
  ----     ------                  ----   ----                     -------
  Warning  FailedScheduling        13m    default-scheduler        0/73 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 73 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling        13m    default-scheduler        0/73 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 73 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling        8m6s   default-scheduler        0/74 nodes are available: 64 Insufficient cpu, 72 Insufficient memory, 74 Insufficient nvidia.com/gpu.
  Normal   Scheduled               7m46s  default-scheduler        Successfully assigned zachary-seguin/gpu-test-0 to aks-usergpuuc-17130176-vmss00003o
  Normal   TriggeredScaleUp        13m    cluster-autoscaler       pod triggered scale-up: [{aks-usergpuuc-17130176-vmss 5->6 (max: 10)}]
  Normal   SuccessfulAttachVolume  7m36s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-50f8ae6b-6c54-4eb3-9e85-1cea16e21e89"
  Normal   Pulling                 7m26s  kubelet                  Pulling image "docker.io/istio/proxyv2:1.5.10"
  Normal   Pulled                  7m20s  kubelet                  Successfully pulled image "docker.io/istio/proxyv2:1.5.10" in 5.819904983s
  Normal   Created                 7m16s  kubelet                  Created container istio-validation
  Normal   Started                 7m16s  kubelet                  Started container istio-validation
  Normal   Pulling                 7m15s  kubelet                  Pulling image "k8scc01covidacr.azurecr.io/jupyterlab-pytorch:1b5c3b61"
  Normal   Pulled                  81s    kubelet                  Successfully pulled image "k8scc01covidacr.azurecr.io/jupyterlab-pytorch:1b5c3b61" in 5m54.548143379s
  Normal   Created                 78s    kubelet                  Created container gpu-test
  Normal   Started                 77s    kubelet                  Started container gpu-test
  Normal   Pulled                  77s    kubelet                  Container image "docker.io/istio/proxyv2:1.5.10" already present on machine
  Normal   Created                 76s    kubelet                  Created container istio-proxy
  Normal   Started                 76s    kubelet                  Started container istio-proxy
  Normal   Pulling                 76s    kubelet                  Pulling image "vault:1.7.2"
  Normal   Pulled                  72s    kubelet                  Successfully pulled image "vault:1.7.2" in 4.579876535s
  Normal   Created                 68s    kubelet                  Created container vault-agent
  Normal   Started                 68s    kubelet                  Started container vault-agent

Node took approximately 6 minutes to be come ready (that is probably about as good as we'll get).

Disk mount took about 1 minute then image pull started.


From an information point, there are currently 7748 network policies in the cluster:

kubectl get networkpolicies -A --no-headers | wc -l
7748

And there are currently 2035 pods:

kubectl get pods -A --no-headers | wc -l
2035

Looking at the network policies, after about 30 minutes, just over half of the iptables rules are in place:

iptables-save | wc -l
18561

Looking at an existing node, we see:

iptables-save | wc -l
29589

The azure-npm pod has been maxed out on its CPU the whole time, so this delay is likely being caused because azure-npm is being heavily throttled during the initial reconciliation and application of the network policies rules on the node.

zachomedia commented 2 years ago

And now about 2 hours later the notebook works, so it just takes time.

blairdrummond commented 2 years ago

Blocked by ticket with Microsoft at the moment. Keep us posted @sylus !

chuckbelisle commented 2 years ago

New limits were introduced by MS and have since improved greatly our network policy generation on server creation