Closed chreichert closed 5 years ago
Comments / Ideas anybody? Issue prevents us from upgrading our PROD environment at the moment. Help is very much appreciated.
Hi @chreichert, could you paste the following output from your cluster?
kubectl get nodes -o wide
kubectl get pods --all-namespaces
Thanks!
We had an AKS cluster running k8s 1.13.5, built using Terraform, which we just upgraded to 1.14.0 over the weekend and now the MongoDB replica set (chart) which ran fine on 1.13.5 explodes under the tiniest load with this same error.
Unfortunately I killed my cluster trying to downgrade. I will try to revive the cluster with an upgrade using 0.36.0 in the next few days. I will report the results here then.
The following was found in my shell history, unfortunately no "nodes -o wide":
``` NAME STATUS ROLES AGE VERSION k8s-dynamic-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss0000ks Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000kt Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000md Ready agent 35d v1.14.1 k8s-elastic-11480702-vmss000000 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000001 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000002 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000003 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000004 Ready agent 44d v1.14.1 k8s-graph-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-master-11480702-0 Ready master 8m11s v1.14.1 k8s-master-11480702-1 Ready master 4h8m v1.14.1 k8s-master-11480702-2 Ready master 5h10m v1.14.1 k8s-static-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000004 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000005 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000006 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000013 Ready agent 35d v1.14.1 ```
```
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default omsagent-msoms-24d2c 1/1 Running 0 19h 10.244.2.2 k8s-elastic-11480702-vmss000001
Is this a request for help?: Yes
Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
What version of aks-engine?: 0.35.1
Kubernetes version: 1.14.1
What happened: After upgrading our QA cluster using AKS-Engine 0.35.1 from K8s 1.11.6 via 1.12.8, 1.13.5 to 1.14.1 workload pods do not start anymore or crash after a while showing the error "pthread_create() failed (11: Resource temporarily unavailable)" or similar. Pods crashing are for example RabbitMQ or Nginx-Ingress controller.
kubectl describe pod shows:
Most of the systems pods run, but some of them (calico for example) do crash too.
Cluster was initially set up with ACS-Engine 0.24.1 (k8s 1.10.9) and upgraded successfully to k8s 1.11.6 with AKS-Engine 0.29.1.
What you expected to happen:
Cluster running normally, with our workloads, that used to be running fine until upgrading with 035.1.
How to reproduce it (as minimally and precisely as possible): Initial setup of Cluster with acs-engine 0.24.1:
Upgrade to 1.11.5 with AKS-Engine 0.29.1 (successful) Upgrade to 1.11.6 with AKS-Engine 0.29.1 (successful) Upgrade to 1.14.1 via 1.12.8 and 1.13.5 (three steps) with AKS-Engine 0.35.1
Anything else we need to know: Luckily this was our test to upgrade on our staging environment before doing the actual upgrade of our PROD envionment.