Pods crash with: pthread_create() failed (11: Resource temporarily unavailable) after cluster upgrade

chreichert commented 5 years ago

Is this a request for help?: Yes

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue

What version of aks-engine?: 0.35.1

Kubernetes version: 1.14.1

What happened: After upgrading our QA cluster using AKS-Engine 0.35.1 from K8s 1.11.6 via 1.12.8, 1.13.5 to 1.14.1 workload pods do not start anymore or crash after a while showing the error "pthread_create() failed (11: Resource temporarily unavailable)" or similar. Pods crashing are for example RabbitMQ or Nginx-Ingress controller.

kubectl describe pod shows:

Name: rabbitmq-rabbitmq-ha-0 Namespace: qa Priority: 0 PriorityClassName: Node: k8s-static-11480702-vmss000013/10.239.0.18 Start Time: Thu, 09 May 2019 16:55:53 +0200 Labels: component=rabbitmq controller-revision-hash=rabbitmq-rabbitmq-ha-5cc8495b8f statefulset.kubernetes.io/pod-name=rabbitmq-rabbitmq-ha-0 type=server Annotations: cni.projectcalico.org/podIP=10.244.92.4/32 Status: Running IP: 10.244.92.4 Controlled By: StatefulSet/rabbitmq-rabbitmq-ha Init Containers: copy-rabbitmq-config: Container ID: docker://5816dd3e044a0ddc497dc0de1cb3736020f9153ea1f38752e022a22ceb014877 Image: qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0> 666262d7ee69b289f2d4379f69362383 Image ID: docker-pullable://qnowsacr.azurecr.io/external/busybox@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383 Port: Host Port: Command: sh -c cp /configmap/* /etc/rabbitmq; rm -f /var/lib/rabbitmq/.erlang.cookie State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 09 May 2019 16:56:34 +0200 Finished: Thu, 09 May 2019 16:56:34 +0200 Ready: True Restart Count: 0 Environment: Mounts: /configmap from configmap (rw) /etc/rabbitmq from config (rw) /var/lib/rabbitmq from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) Containers: rabbitmq-ha: Container ID: docker://3313cb9f1cbdbe67c5fb19bfb3c2ade0eaf95456917e9aa04636c1c5740b009b Image: qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70 Image ID: docker-pullable://qnowsacr.azurecr.io/external/rabbitmq@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70 Ports: 4369/TCP, 5672/TCP, 15672/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP State: Running Started: Thu, 09 May 2019 16:56:42 +0200 Ready: True Restart Count: 0 Limits: cpu: 8 memory: 55Gi Requests: cpu: 8 memory: 55Gi Environment: MY_POD_NAME: rabbitmq-rabbitmq-ha-0 (v1:metadata.name) RABBITMQ_USE_LONGNAME: true RABBITMQ_NODENAME: rabbit@$(MY_POD_NAME).rabbitmq-rabbitmq-ha-discovery.qa.svc.cluster.local K8S_HOSTNAME_SUFFIX: .rabbitmq-rabbitmq-ha-discovery.qa.svc.cluster.local K8S_SERVICE_NAME: rabbitmq-rabbitmq-ha-discovery RABBITMQ_ERLANG_COOKIE: <set to the key 'rabbitmq-erlang-cookie' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_USER: <set to the key 'rabbitmq-admin-username' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_PASS: <set to the key 'rabbitmq-admin-password' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_VHOST: / Mounts: /etc/rabbitmq from config (rw) /var/lib/rabbitmq from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) rabbitmq-ha-exporter: Container ID: docker://ffe6e5f755d9b77eb62addbccfe557756c21c73debb4274915e193f72fcbc4f6 Image: qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21 Image ID: docker-pullable://qnowsacr.azurecr.io/external/rabbitmq-exporter@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21 Port: 9419/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: ContainerCannotRun Message: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"read init-p: connection reset by peer\"": unknown Exit Code: 128 Started: Thu, 09 May 2019 16:56:58 +0200 Finished: Thu, 09 May 2019 16:56:58 +0200 Ready: False Restart Count: 2 Environment: PUBLISH_PORT: 9419 RABBIT_CAPABILITIES: bert,no_sort RABBIT_USER: admin RABBIT_PASSWORD: <set to the key 'rabbitmq-admin-password' in secret 'rabbitmq-provided'> Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-rabbitmq-rabbitmq-ha-0 ReadOnly: false config: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
configmap: Type: ConfigMap (a volume populated by a ConfigMap) Name: rabbitmq Optional: false rabbitmq-token-6qx8h: Type: Secret (a volume populated by a Secret) SecretName: rabbitmq-token-6qx8h Optional: false QoS Class: Burstable Node-Selectors: agentpool=static Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Normal Scheduled 1m default-scheduler Successfully assigned qa/rabbitmq-rabbitmq-ha-0 to k8s-static-11480702-vmss000013 Normal SuccessfulAttachVolume 1m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3f916d40-5615-11e9-9006-000d3ab8e732" Normal Pulling 52s kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383" Normal Pulled 51s kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383" Normal Created 51s kubelet, k8s-static-11480702-vmss000013 Created container copy-rabbitmq-config Normal Started 51s kubelet, k8s-static-11480702-vmss000013 Started container copy-rabbitmq-config Normal Pulling 50s kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70" Normal Pulled 45s kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70" Normal Created 44s kubelet, k8s-static-11480702-vmss000013 Created container rabbitmq-ha Normal Started 43s kubelet, k8s-static-11480702-vmss000013 Started container rabbitmq-ha Normal Pulling 27s (x3 over 43s) kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21" Normal Pulled 27s (x3 over 42s) kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21" Normal Created 27s (x3 over 42s) kubelet, k8s-static-11480702-vmss000013 Created container rabbitmq-ha-exporter Warning Failed 26s (x3 over 41s) kubelet, k8s-static-11480702-vmss000013 Error: failed to start container "rabbitmq-ha-exporter": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"read init-p: connection reset by peer\"": unknown Warning BackOff 8s (x3 over 12s) kubelet, k8s-static-11480702-vmss000013 Back-off restarting failed container

Most of the systems pods run, but some of them (calico for example) do crash too.

Cluster was initially set up with ACS-Engine 0.24.1 (k8s 1.10.9) and upgraded successfully to k8s 1.11.6 with AKS-Engine 0.29.1.

What you expected to happen:

Cluster running normally, with our workloads, that used to be running fine until upgrading with 035.1.

How to reproduce it (as minimally and precisely as possible): Initial setup of Cluster with acs-engine 0.24.1:

Kubernetes 1.10.9
Private cluster
RBAC enabled
Calico Networkpolicy
AAD Integration enabled
Disable addons:
- blobfuse-flexvolume
- smb-flexvolume
3 Master
3 Agentpools as vmss

Upgrade to 1.11.5 with AKS-Engine 0.29.1 (successful) Upgrade to 1.11.6 with AKS-Engine 0.29.1 (successful) Upgrade to 1.14.1 via 1.12.8 and 1.13.5 (three steps) with AKS-Engine 0.35.1

Anything else we need to know: Luckily this was our test to upgrade on our staging environment before doing the actual upgrade of our PROD envionment.

chreichert commented 5 years ago

Comments / Ideas anybody? Issue prevents us from upgrading our PROD environment at the moment. Help is very much appreciated.

jackfrancis commented 5 years ago

Hi @chreichert, could you paste the following output from your cluster?

kubectl get nodes -o wide
kubectl get pods --all-namespaces

Thanks!

adamlundrigan commented 5 years ago

We had an AKS cluster running k8s 1.13.5, built using Terraform, which we just upgraded to 1.14.0 over the weekend and now the MongoDB replica set (chart) which ran fine on 1.13.5 explodes under the tiniest load with this same error.

Some logs showing the failure

``` 2019-05-20T23:47:46.158+0000 I REPL [replexec-2] Starting an election, since we've seen no PRIMARY in the past 10000ms 2019-05-20T23:47:46.158+0000 I REPL [replexec-0] VoteRequester(term 116 dry run) received a yes vote from mongo-mongodb-replicaset-0.mongo-mongodb-replicaset.common.svc.cluster.local:27017 response message: { term: 116, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1558395741, 1), $clusterTime: { clusterTime: Timestamp(1558395741, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2019-05-20T23:47:46.158+0000 I REPL [replexec-0] dry election run succeeded, running for election in term 117 2019-05-20T23:47:46.158+0000 I REPL [replexec-2] conducting a dry run election to see if we could be elected. current term: 116 2019-05-20T23:47:46.160+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 - HostUnreachable: Connection refused 2019-05-20T23:47:46.160+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 due to failed operation on a connection 2019-05-20T23:47:46.163+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 2019-05-20T23:47:46.164+0000 I REPL [replexec-2] VoteRequester(term 117) failed to receive response from mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017: HostUnreachable: Connection refused 2019-05-20T23:47:46.164+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 - HostUnreachable: Connection refused 2019-05-20T23:47:46.164+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 due to failed operation on a connection 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] transition to PRIMARY from SECONDARY 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] VoteRequester(term 117) received a yes vote from mongo-mongodb-replicaset-0.mongo-mongodb-replicaset.common.svc.cluster.local:27017 response message: { term: 117, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1558395741, 1), $clusterTime: { clusterTime: Timestamp(1558395741, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2019-05-20T23:47:46.168+0000 F - [replexec-1] terminate() called. An exception is active attempting to gather more information 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] Resetting sync source to empty, which was :27017 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] election succeeded, assuming primary role in term 117 mongod(_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv+0x99F) [0x55a256b38b5f] ----- END BACKTRACE ----- mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55a2573635a1] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl22_onVoteRequestCompleteEx+0x2BE) [0x55a256235f4e] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl24_cancelHeartbeats_inlockEv+0xE1) [0x55a256237241] mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x55a2574575f6] {"backtrace":[{"b":"55A2550F9000","o":"226A5A1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55A2550F9000","o":"2269F85"},{"b":"55A2550F9000","o":"235E5F6","s":"_ZN10__cxxabiv111__terminateEPFvvE"},{"b":"55A2550F9000","o":"235E641"},{"b":"55A2550F9000","o":"1A3FB5F","s":"_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv"},{"b":"55A2550F9000","o":"1A403B8","s":"_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE"},{"b":"55A2550F9000","o":"1CE472C","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESD_St11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1CE4F2D","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESt11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1CE5EAE","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor6cancelERKNS0_12TaskExecutor14CallbackHandleE"},{"b":"55A2550F9000","o":"113E241","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_cancelHeartbeats_inlockEv"},{"b":"55A2550F9000","o":"1144671","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl25_restartHeartbeats_inlockEv"},{"b":"55A2550F9000","o":"1129C2E","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl40_postWonElectionUpdateMemberState_inlockEv"},{"b":"55A2550F9000","o":"113CF4E","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl22_onVoteRequestCompleteEx"},{"b":"55A2550F9000","o":"1CE3AB3","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE"},{"b":"55A2550F9000","o":"1CE3F9B"},{"b":"55A2550F9000","o":"1A3C34C","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1A3C84C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"55A2550F9000","o":"1A3D236","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"55A2550F9000","o":"2379850"},{"b":"7FA82DAE0000","o":"76BA"},{"b":"7FA82D716000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.12", "gitVersion" : "c2b9acad0248ca06b14ef1640734b5d0595b55f1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.15.0-1042-azure", "version" : "#46-Ubuntu SMP Thu Apr 4 16:30:23 UTC 2019", "machine" : "x86_64" }, "somap" : [ { "b" : "55A2550F9000", "elfType" : 3, "buildId" : "2B5EE1E50AC12CC569CE7CD8B7812FF349257B77" }, { "b" : "7FFFBF1E6000", "elfType" : 3, "buildId" : "DD321E9190D9BD55E4CD0080B2F9A163099EBD04" }, { "b" : "7FA82ECD6000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "50A923F8DAFECBCD969C8573116A38C18D0E24D5" }, { "b" : "7FA82E891000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "15FFEB43278726B025F020862BF51302822A40EC" }, { "b" : "7FA82E628000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF69EA60EBE05F2DD689D2B26FC85A73E5FBC3A0" }, { "b" : "7FA82E424000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "37BFC3D8F7E3B022DAC7943B1A5FACD40CEBF0AD" }, { "b" : "7FA82E21C000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "69143E8B39040C964D3958490535322675F15DD3" }, { "b" : "7FA82DF13000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "BAD67A84E56E73D031AE507261DA066B35949D34" }, { "b" : "7FA82DCFD000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7FA82DAE0000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "B17C21299099640A6D863E423D99265824E7BB16" }, { "b" : "7FA82D716000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "1CA54A6E0D76188105B12E49FE6B8019BF08803A" }, { "b" : "7FA82EEF1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "C0ADBAD6F9A33944F2B3567C078EC472A1DAE98E" } ] }} mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xBC) [0x55a256b3584c] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl25_restartHeartbeats_inlockEv+0x11) [0x55a25623d671] mongod(+0x2379850) [0x55a257472850] mongod(+0x2269F85) [0x55a257362f85] libc.so.6(clone+0x6D) [0x7fa82d81d41d] 2019-05-20T23:47:46.188+0000 F - [replexec-1] std::exception::what(): Resource temporarily unavailable mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x96) [0x55a256b36236] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESD_St11unique_lockISt5mutexE+0x24C) [0x55a256ddd72c] mongod(+0x235E641) [0x55a257457641] mongod(_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE+0x398) [0x55a256b393b8] libpthread.so.0(+0x76BA) [0x7fa82dae76ba] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl40_postWonElectionUpdateMemberState_inlockEv+0x15E) [0x55a256222c2e] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor6cancelERKNS0_12TaskExecutor14CallbackHandleE+0x14E) [0x55a256ddeeae] mongod(+0x1CE3F9B) [0x55a256ddcf9b] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x1B3) [0x55a256ddcab3] mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x14C) [0x55a256b3534c] Actual exception type: std::system_error 0x55a2573635a1 0x55a257362f85 0x55a2574575f6 0x55a257457641 0x55a256b38b5f 0x55a256b393b8 0x55a256ddd72c 0x55a256dddf2d 0x55a256ddeeae 0x55a256237241 0x55a25623d671 0x55a256222c2e 0x55a256235f4e 0x55a256ddcab3 0x55a256ddcf9b 0x55a256b3534c 0x55a256b3584c 0x55a256b36236 0x55a257472850 0x7fa82dae76ba 0x7fa82d81d41d mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESt11unique_lockISt5mutexE+0x4D) [0x55a256dddf2d] ----- BEGIN BACKTRACE ----- ```

kubectl get nodes -o wide

``` NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aks-xxxprodpool-29478185-0 Ready agent 47h v1.14.0 10.20.0.66 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-1 Ready agent 47h v1.14.0 10.20.0.35 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-2 Ready agent 47h v1.14.0 10.20.0.4 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-4 Ready agent 2d v1.14.0 10.20.0.129 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 ```

kubectl get pods --all-namespaces

_I had already scaled the StatefulSet running the MongoDB down to zero at this point_ ``` NAMESPACE NAME READY STATUS RESTARTS AGE tenant1 sso-app-5d66f5647-b5wxg 1/1 Running 0 2d tenant1 sso-web-6cfb7767f7-hppdp 1/1 Running 0 47h tenant1 www-app-85969b4d84-qhjrg 1/1 Running 0 47h tenant1 www-web-7b9f9dbf7c-92gxg 1/1 Running 0 47h common redis-master-0 1/1 Running 0 2d common redis-slave-0 1/1 Running 1 2d common redis-slave-1 1/1 Running 0 47h tenant0 sso-app-bc87d4995-dcb5q 1/1 Running 0 47h tenant0 sso-web-6cfb7767f7-cmbb9 1/1 Running 0 2d tenant0 www-app-787c868777-58mbt 1/1 Running 0 47h tenant0 www-web-5d5455bf44-2fdjx 1/1 Running 0 2d kube-system azure-cni-networkmonitor-cgtxj 1/1 Running 0 47h kube-system azure-cni-networkmonitor-d4cfg 1/1 Running 0 2d kube-system azure-cni-networkmonitor-ssmrx 1/1 Running 0 2d kube-system azure-cni-networkmonitor-vwjmf 1/1 Running 0 47h kube-system azure-ip-masq-agent-47cv8 1/1 Running 0 47h kube-system azure-ip-masq-agent-ghwkv 1/1 Running 0 47h kube-system azure-ip-masq-agent-klfnr 1/1 Running 0 2d kube-system azure-ip-masq-agent-zvsgr 1/1 Running 0 2d kube-system coredns-74d5c9d599-6mnq7 1/1 Running 0 47h kube-system coredns-74d5c9d599-qqqbr 1/1 Running 0 47h kube-system coredns-autoscaler-6946b57db6-jnlfw 1/1 Running 0 47h kube-system kube-proxy-8d5sq 1/1 Running 0 47h kube-system kube-proxy-g5q5h 1/1 Running 0 2d kube-system kube-proxy-kjw22 1/1 Running 0 47h kube-system kube-proxy-m6ppb 1/1 Running 0 2d kube-system kube-svc-redirect-bf47c 2/2 Running 0 2d kube-system kube-svc-redirect-fjd9p 2/2 Running 0 47h kube-system kube-svc-redirect-kfcqt 2/2 Running 0 47h kube-system kube-svc-redirect-lfd54 2/2 Running 0 2d kube-system kubernetes-dashboard-c4f4999c8-lxxc9 1/1 Running 2 47h kube-system metrics-server-766dd9f7fd-v457r 1/1 Running 0 2d kube-system nginx-ingress-controller-65c869bb6d-clkxf 1/1 Running 0 2d kube-system nginx-ingress-controller-65c869bb6d-h6t7l 1/1 Running 0 2d kube-system nginx-ingress-default-backend-647c8f49bb-8tnsn 1/1 Running 0 2d kube-system omsagent-8tzfs 1/1 Running 0 47h kube-system omsagent-j2wrh 1/1 Running 0 2d kube-system omsagent-m9d6f 1/1 Running 0 2d kube-system omsagent-rs-79f67c9ffc-zx9l6 1/1 Running 0 47h kube-system omsagent-xc5nd 1/1 Running 1 47h kube-system tiller-deploy-664d6bdc7b-7zbkl 1/1 Running 0 47h kube-system tunnelfront-98dc59889-nx7kz 1/1 Running 0 2d tenant3 sso-app-76bd57569-gm44b 1/1 Running 0 47h tenant3 sso-web-6cfb7767f7-4hdhg 1/1 Running 0 2d tenant3 www-app-8659dd67bd-4jjbr 1/1 Running 0 2d tenant3 www-web-fbd7d8f47-566hv 1/1 Running 0 2d tenantshared cms-app-7d9bdc7f9-4jmwg 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-4q54j 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-pvnhf 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-rt6m5 1/1 Running 1 2d tenantshared cms-web-5f698b7f58-xrbs4 1/1 Running 0 47h tenantshared cms-web-5f698b7f58-zdzqt 1/1 Running 0 2d tenantshared crm-app-564c9dcc89-6xv4w 1/1 Running 0 47h tenantshared crm-app-564c9dcc89-pnmj8 1/1 Running 0 47h tenantshared crm-app-564c9dcc89-twsms 1/1 Running 1 2d tenantshared crm-app-564c9dcc89-wb8xq 1/1 Running 0 47h tenantshared crm-memcached-0 1/1 Running 0 47h tenantshared crm-memcached-1 1/1 Running 0 47h tenantshared crm-web-598cf986d-z8zdw 1/1 Running 0 47h tenantshared lrs-api-55779fd8b8-fqkh7 1/1 Running 0 5h32m tenantshared lrs-api-55779fd8b8-lcwhg 1/1 Running 0 9h tenantshared lrs-app-7dcf7858bb-jz97l 1/1 Running 0 9h tenantshared lrs-app-7dcf7858bb-vmps7 1/1 Running 0 5h33m tenantshared lrs-web-57d7786dc-cjlmf 1/1 Running 0 5h33m tenantshared lrs-web-57d7786dc-fktt7 1/1 Running 0 47h tenantshared lrs-worker-865545fc94-xmgbn 1/1 Running 0 5h33m tenantshared lrs-worker-865545fc94-xzqd2 1/1 Running 0 9h tenantshared lrs-xapi-787db988c-bclb5 1/1 Running 0 9h tenantshared lrs-xapi-787db988c-d5kxg 1/1 Running 0 5h35m tenantshared lrs-xapi-787db988c-mcxgb 1/1 Running 0 5h35m tenant2 sso-app-7cf76596df-tg9g5 1/1 Running 0 47h tenant2 sso-web-6cfb7767f7-s56lk 1/1 Running 0 2d tenant2 www-app-6b8c5574c-nqrd7 1/1 Running 0 47h tenant2 www-web-78b6566986-mjhr7 1/1 Running 0 2d ```

chreichert commented 5 years ago

Unfortunately I killed my cluster trying to downgrade. I will try to revive the cluster with an upgrade using 0.36.0 in the next few days. I will report the results here then.

The following was found in my shell history, unfortunately no "nodes -o wide":

kubectl get nodes

``` NAME STATUS ROLES AGE VERSION k8s-dynamic-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss0000ks Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000kt Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000md Ready agent 35d v1.14.1 k8s-elastic-11480702-vmss000000 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000001 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000002 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000003 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000004 Ready agent 44d v1.14.1 k8s-graph-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-master-11480702-0 Ready master 8m11s v1.14.1 k8s-master-11480702-1 Ready master 4h8m v1.14.1 k8s-master-11480702-2 Ready master 5h10m v1.14.1 k8s-static-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000004 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000005 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000006 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000013 Ready agent 35d v1.14.1 ```

kubectl get pods --all-namespaces

``` NAMESPACE NAME default omsagent-msoms-24d2c default omsagent-msoms-25wh2 default omsagent-msoms-2tdlf default omsagent-msoms-645ks default omsagent-msoms-7fsg5 default omsagent-msoms-7hfpd default omsagent-msoms-ch2fn default omsagent-msoms-d8mnv default omsagent-msoms-df49h default omsagent-msoms-dtvlk default omsagent-msoms-fsdfl default omsagent-msoms-lc8wk default omsagent-msoms-lnxxw default omsagent-msoms-mxdqw default omsagent-msoms-n65j2 default omsagent-msoms-nxjdn default omsagent-msoms-qvvkv default omsagent-msoms-rl48s default omsagent-msoms-sbptz default omsagent-msoms-tjd86 default omsagent-msoms-zgxpj kube-system azure-ip-masq-agent-4z52z kube-system azure-ip-masq-agent-5625l kube-system azure-ip-masq-agent-5dg5v kube-system azure-ip-masq-agent-5nxnq kube-system azure-ip-masq-agent-bdg5h kube-system azure-ip-masq-agent-brpqr kube-system azure-ip-masq-agent-bsmww kube-system azure-ip-masq-agent-bvr9j kube-system azure-ip-masq-agent-c56xv kube-system azure-ip-masq-agent-ckxgd kube-system azure-ip-masq-agent-djs8p kube-system azure-ip-masq-agent-dp79d kube-system azure-ip-masq-agent-g2t4r kube-system azure-ip-masq-agent-gf8c4 kube-system azure-ip-masq-agent-gqflb kube-system azure-ip-masq-agent-jjhfd kube-system azure-ip-masq-agent-llktn kube-system azure-ip-masq-agent-lmf95 kube-system azure-ip-masq-agent-p7sww kube-system azure-ip-masq-agent-qzz2d kube-system azure-ip-masq-agent-sxtfx kube-system azure-ip-masq-agent-t45d2 kube-system azure-ip-masq-agent-xv6dd kube-system azure-ip-masq-agent-znnd8 kube-system calico-node-4pl2p kube-system calico-node-4qk46 kube-system calico-node-6fkgv kube-system calico-node-9p6b8 kube-system calico-node-9wqqj kube-system calico-node-bf76m kube-system calico-node-dff49 kube-system calico-node-dwlgt kube-system calico-node-f7hnw kube-system calico-node-h6cwh kube-system calico-node-j8dpz kube-system calico-node-jkwsz kube-system calico-node-kd7qs kube-system calico-node-krz9p kube-system calico-node-ltvp7 kube-system calico-node-ms454 kube-system calico-node-ntsrt kube-system calico-node-pkps4 kube-system calico-node-q8jsj kube-system calico-node-vs2vn kube-system calico-node-xfwbv kube-system calico-node-xz6p9 kube-system calico-node-zgdkv kube-system calico-node-zp885 kube-system calico-typha-5bd7cfc5fb-9nf47 kube-system calico-typha-5bd7cfc5fb-t5mgt kube-system calico-typha-horizon kube-system coredns-7bc69b7975-wrrkr kube-system heapster-66bb49cb7b-vxmsx kube-system kube-addon-manager-k kube-system kube-addon-manager-k kube-system kube-addon-manager-k kube-system kube-apiserver-k8s-m kube-system kube-apiserver-k8s-m kube-system kube-apiserver-k8s-m kube-system kube-controller-mana kube-system kube-controller-mana kube-system kube-controller-mana kube-system kube-proxy-2xkd7 kube-system kube-proxy-4w4qj kube-system kube-proxy-59jmc kube-system kube-proxy-62rdx kube-system kube-proxy-68zrs kube-system kube-proxy-dbvhg kube-system kube-proxy-dccdg kube-system kube-proxy-dh8ms kube-system kube-proxy-f7jwn kube-system kube-proxy-fxt4b kube-system kube-proxy-k6mpx kube-system kube-proxy-kxrd5 kube-system kube-proxy-ljbhj kube-system kube-proxy-nmftd kube-system kube-proxy-p4d8j kube-system kube-proxy-p729w kube-system kube-proxy-qrnvw kube-system kube-proxy-rsbhc kube-system kube-proxy-s9jps kube-system kube-proxy-sb4bn kube-system kube-proxy-szvgn kube-system kube-proxy-wzwn2 kube-system kube-proxy-x98lv kube-system kube-proxy-xnxqz kube-system kube-scheduler-k8s-m kube-system kube-scheduler-k8s-m kube-system kube-scheduler-k8s-m kube-system kubernetes-dashboard kube-system metrics-server-5fdc668b9b-vx8pl kube-system tiller-deploy-88c69b9b-x5k6p qa prometheus-kube-stat qa prometheus-node-exporter-2tddg qa prometheus-node-exporter-58dct qa prometheus-node-exporter-5969s qa prometheus-node-exporter-6fwlr qa prometheus-node-exporter-9lpt5 qa prometheus-node-exporter-b5w9m qa prometheus-node-exporter-c929k qa prometheus-node-exporter-cb2gp qa prometheus-node-exporter-ch6tc qa prometheus-node-exporter-hmndz qa prometheus-node-exporter-j9hbm qa prometheus-node-exporter-jqqth qa prometheus-node-exporter-k47b9 qa prometheus-node-exporter-ktxpw qa prometheus-node-exporter-m6l9n qa prometheus-node-exporter-nsgcf qa prometheus-node-exporter-qfwb2 qa prometheus-node-exporter-sxctq qa prometheus-node-exporter-t2dpb qa prometheus-node-exporter-wljxr qa prometheus-node-exporter-xrr5q qa qa-ingress-nginx-ing qa qknows-default-backe qa rabbitmq-rabbitmq-ha-0 ``` READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 1/1 Running 0 19h 10.244.2.2 k8s-elastic-11480702-vmss000001 1/1 Running 0 19h 10.244.4.2 k8s-elastic-11480702-vmss000003 1/1 Running 2 19h 10.244.104.2 k8s-dynamic-11480702-vmss0000md 1/1 Running 0 19h 10.244.6.2 k8s-elastic-11480702-vmss000002 1/1 Running 0 19h 10.244.20.2 k8s-static-11480702-vmss000005 1/1 Running 0 19h 10.244.1.2 k8s-elastic-11480702-vmss000000 1/1 Running 0 19h 10.244.5.4 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.244.7.2 k8s-static-11480702-vmss000002 1/1 Running 1 19h 10.244.22.3 k8s-dynamic-11480702-vmss000002 1/1 Running 0 19h 10.244.92.2 k8s-static-11480702-vmss000013 1/1 Running 0 19h 10.244.0.2 k8s-elastic-11480702-vmss000004 1/1 Running 0 151m 10.244.3.2 k8s-dynamic-11480702-vmss0000oh 1/1 Running 1 19h 10.244.27.3 k8s-graph-11480702-vmss000000 1/1 Running 1 151m 10.244.23.2 k8s-dynamic-11480702-vmss0000of 1/1 Running 1 19h 10.244.10.2 k8s-dynamic-11480702-vmss0000ks 1/1 Running 0 19h 10.244.9.2 k8s-static-11480702-vmss000000 1/1 Running 0 19h 10.244.17.2 k8s-static-11480702-vmss000001 1/1 Running 1 150m 10.244.25.2 k8s-dynamic-11480702-vmss0000og 1/1 Running 0 143m 10.244.21.2 k8s-static-11480702-vmss000014 1/1 Running 0 19h 10.244.14.2 k8s-dynamic-11480702-vmss000001 1/1 Running 0 143m 10.244.11.2 k8s-static-11480702-vmss000015 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 1/1 Running 0 145m 10.239.0.29 k8s-static-11480702-vmss000015 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of 1/1 Running 0 145m 10.239.0.24 k8s-static-11480702-vmss000014 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of 0/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 0/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 1/1 Running 0 146m 10.239.0.24 k8s-static-11480702-vmss000014 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 1/1 Running 0 146m 10.239.0.29 k8s-static-11480702-vmss000015 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 1/1 Running 0 150m 10.239.0.26 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks tal-autoscaler-b4f9b8fdc-7hx2p 1/1 Running 0 19h 10.244.5.2 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.244.14.3 k8s-dynamic-11480702-vmss000001 2/2 Running 1 150m 10.244.0.3 k8s-elastic-11480702-vmss000004 8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 aster-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 aster-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 aster-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 ger-k8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 ger-k8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 ger-k8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 1/1 Running 0 145m 10.239.0.24 k8s-static-11480702-vmss000014 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks 1/1 Running 0 145m 10.239.0.29 k8s-static-11480702-vmss000015 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 aster-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 aster-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 aster-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 -7b5859758b-k788b 1/1 Running 0 19h 10.244.27.2 k8s-graph-11480702-vmss000000 1/1 Running 0 19h 10.244.22.2 k8s-dynamic-11480702-vmss000002 1/1 Running 0 150m 10.244.27.4 k8s-graph-11480702-vmss000000 e-metrics-54bd47b45f-8b726 1/1 Running 0 19h 10.244.92.3 k8s-static-11480702-vmss000013 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks 1/1 Running 0 143m 10.239.0.29 k8s-static-11480702-vmss000015 1/1 Running 0 143m 10.239.0.24 k8s-static-11480702-vmss000014 1/1 Running 0 150m 10.239.0.10 k8s-dynamic-11480702-vmss0000og 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 ress-controller-c48c4ccb-lmk2l 1/1 Running 4 19h 10.244.5.3 k8s-static-11480702-vmss000004 nd-7d68787864-zdtm4 1/1 Running 0 150m 10.244.104.3 k8s-dynamic-11480702-vmss0000md 1/2 RunContainerError 7 3m55s 10.244.17.3 k8s-static-11480702-vmss000001

jackfrancis commented 5 years ago

@chreichert yes, thank you, it would make more sense that these failures were reproducible on new clusters as well, and not isolated to clusters having traversed the upgrade flow.

(Though that's arguably a worse error, it's at least easier to reason/isolate as being something in 1.14)

chreichert commented 5 years ago

@jackfrancis I was able to recover my QA cluster by force downgrading it to 1.13.5 with aks-engine 0.34.3. and a little help of directly deploying manually patched templates with az group deployments.

(Problem was that aks-engine upgrade fails after every master deployment in our Azure AD enabled cluster, which does not go well with --force. In a multi master setup this only downgrades the first master and the agents. The remaining masters must be done with az deployments, which I finally manged to get done.)

First tests are promising, that everything is running again like before. Will do some more tests tomorrow.

So it looks like it really might be something in 1.14...

tomgallard commented 5 years ago

Just a note that we are seeing exactly this issue occur on our AKS QA cluster after upgrade to 1.14.0 . Affecting Nginx Ingress pods only as far as I can see.

Have raised a support ticket as well.

tomgallard commented 5 years ago

Worth noting for anyone else experiencing this issue that we resolved it for now by adding explicit cpu and memory limits to the nginx-ingress deployment. Not sure why this fixed things but it did.

chreichert commented 5 years ago

Unfortunately it does not help in our case. We already had cpu and memory limits defined and rabbitmq pods are crashing, after 1.14.1 upgrade.

jackfrancis commented 5 years ago

@chreichert to confirm: have we repro'd this on a newly built 1.14.n cluster as well?

chreichert commented 5 years ago

@jackfrancis I will try to repro this with our apimodel with a newly built 1.14.1 cluster tomorrow. I will get back after doing this.

I could reproduce the issue, setting up a new cluster and upgrading it through the instances to 1.14.1 (following our initial path for our production environment):

My steps to reproduce:

Setup a fresh 1.10.7 cluster using ACS-Engine 0.21.2

api-model

``` { "apiVersion": "vlabs", "properties": { "orchestratorProfile": { "orchestratorType": "Kubernetes", "orchestratorRelease": "1.10", "kubernetesConfig": { "addons": [ { "name": "blobfuse-flexvolume", "enabled": false }, { "name": "smb-flexvolume", "enabled": false }, { "name": "keyvault-flexvolume", "enabled": false }, { "name": "cluster-autoscaler", "enabled": false, "containers": [ { "name": "cluster-autoscaler", "cpuRequests": "100m", "memoryRequests": "300Mi", "cpuLimits": "100m", "memoryLimits": "300Mi" } ], "config": { "maxNodes": "5", "minNodes": "1" } } ], "enableRbac": true, "privateCluster": { "enabled": true }, "networkPlugin": "kubenet", "networkPolicy": "calico", "cloudProviderBackoff": true, "cloudProviderBackoffRetries": 6, "cloudProviderBackoffJitter": 1, "cloudProviderBackoffDuration": 5, "cloudProviderBackoffExponent": 1.5, "cloudProviderRateLimit": false, "cloudProviderRateLimitQPS": 3, "cloudProviderRateLimitBucket": 10 } }, "aadProfile": { "serverAppID": "***", "clientAppID": "***", "tenantID": "***" }, "masterProfile": { "count": 3, "dnsPrefix": "***", "vmSize": "Standard_D2s_v3", "OSDiskSizeGB": 128, "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet", "firstConsecutiveStaticIP": "10.239.255.10", "vnetCidr": "10.239.0.0/16" }, "agentPoolProfiles": [ { "name": "dynamic", "count": 4, "vmSize": "Standard_D16s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "graph", "count": 1, "vmSize": "Standard_E32s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "static", "count": 8, "vmSize": "Standard_D16s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "elastic", "count": 5, "vmSize": "Standard_D32s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" } ], "linuxProfile": { "adminUsername": "azureuser", "ssh": { "publicKeys": [ { "keyData": "***" } ] } }, "servicePrincipalProfile": { "clientId": "***", "secret": "***" } } } ```
Upgrade cluster to 1.11.5 using ACS-Engine 0.26.2 -> working fine
Upgrade cluster to 1.11.6 using AKS-Engine 0.29.1 -> working fine (representing current PROD state)
Upgrade cluster to 1.12.8 using AKS-Engine 0.36.4 -> some manual tweaking necessary due to Calico upgrade but finally working fine (first master became ready after adding tolerations to calico-node pod, Agent nodes needed Re-Imaging to become ready)
Upgrade cluster to 1.13.5 using AKS-Engine 0.36.4 -> working fine
Upgrade cluster to 1.14.1 using AKS-Engine 0.36.4 -> rabbitmq and other pods crash with the pthread error

chreichert commented 5 years ago

@jackfrancis Today I did a direct setup of a fresh 1.14.1 cluster using AKS-Engine 0.36.4 and our apimodel as posted above. I deployed our application components. Everything is running fine so far. So I would pin down the issue to be something in the upgrade to 1.14.n!

chreichert commented 5 years ago

@jackfrancis Today I found the cause of the issue: I compared the generated apimodel.json from the newly setup (working) 1.14.1 cluster to the one of the updated (not working) cluster: One of the differences was the parameters "--pod-max-pids" in kubeletConfig of kubernetesConfig, masterProfile and all the agentPoolProfiles. In the newly setup cluster the parameters were set to "-1", while in the updated cluster the parameters were set to "100". So I did another test: Before upgrading my (working) 1.13.5 cluster, I patched the apimodel.json and set all occurences of "--pod-max-pids" to "-1". After that I did a normal upgrade using AKS-Engine 0.36.4, that completed without any errors. I deployed our application components and could not find any errors anymore. Everything is running fine on the upgraded cluster now.

So my conclusion is, that the upgrade does not patch the default of "--pod-max-pids", which I believe must be set to "-1" in a "standard" 1.14.n cluster.

Workaround for upgrading clusters initially setup with ACS-Engine or AKS-Engine before 0.34.3, where the change of this parameter was introduced (PR #1126), is manually patching apimodel.json as described above, before upgrading from 1.13.n to 1.14.n.

jackfrancis commented 5 years ago

@chreichert Thanks for the update! I think what's actually happening is this:

prior versions of AKS Engine were (in practice, incorrectly) setting --pod-max-pids to "100"; in practice, unless the user explicitly defined a kubelet feature gate SupportPodPidsLimit=true
w/ 1.14, kubernetes sets SupportPodPidsLimit to true by default, which in practice meant that clusters are now default-enforced to an actually working --pod-max-pids=100 runtime configuration

Hope that makes sense. We're saying the same thing functionally. I have a PR which should fix upgrade scenarios:

https://github.com/Azure/aks-engine/pull/1508

chreichert commented 5 years ago

/reopen

@jackfrancis Sorry, but the issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3

The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubeletConfig.

The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade.

Workload pods still crash after uprading with 0.37.3.

acs-bot commented 5 years ago

@chreichert: Reopened this issue.

In response to [this](https://github.com/Azure/aks-engine/issues/1270#issuecomment-505779484): >/reopen > >@jackfrancis Sorry, but the issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3 > >The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubelet.config. > >The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade. > >Workload pods still crash after uprading with 0.37.3. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

chreichert commented 5 years ago

@jackfrancis Can you please have a look again? The issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3

The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubeletConfig.

The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade.

Workload pods still crash after upgrading with 0.37.3.

robinkb commented 5 years ago

I am seeing the same symptoms using AKS (non-engine) 1.14.3.

Azure / aks-engine

Pods crash with: pthread_create() failed (11: Resource temporarily unavailable) after cluster upgrade #1270