Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

Pods crash with: pthread_create() failed (11: Resource temporarily unavailable) after cluster upgrade #1270

Closed chreichert closed 5 years ago

chreichert commented 5 years ago

Is this a request for help?: Yes


Is this an ISSUE or FEATURE REQUEST? (choose one): Issue


What version of aks-engine?: 0.35.1


Kubernetes version: 1.14.1

What happened: After upgrading our QA cluster using AKS-Engine 0.35.1 from K8s 1.11.6 via 1.12.8, 1.13.5 to 1.14.1 workload pods do not start anymore or crash after a while showing the error "pthread_create() failed (11: Resource temporarily unavailable)" or similar. Pods crashing are for example RabbitMQ or Nginx-Ingress controller.

kubectl describe pod shows:

Name: rabbitmq-rabbitmq-ha-0 Namespace: qa Priority: 0 PriorityClassName: Node: k8s-static-11480702-vmss000013/10.239.0.18 Start Time: Thu, 09 May 2019 16:55:53 +0200 Labels: component=rabbitmq controller-revision-hash=rabbitmq-rabbitmq-ha-5cc8495b8f statefulset.kubernetes.io/pod-name=rabbitmq-rabbitmq-ha-0 type=server Annotations: cni.projectcalico.org/podIP=10.244.92.4/32 Status: Running IP: 10.244.92.4 Controlled By: StatefulSet/rabbitmq-rabbitmq-ha Init Containers: copy-rabbitmq-config: Container ID: docker://5816dd3e044a0ddc497dc0de1cb3736020f9153ea1f38752e022a22ceb014877 Image: qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0> 666262d7ee69b289f2d4379f69362383 Image ID: docker-pullable://qnowsacr.azurecr.io/external/busybox@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383 Port: Host Port: Command: sh -c cp /configmap/* /etc/rabbitmq; rm -f /var/lib/rabbitmq/.erlang.cookie State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 09 May 2019 16:56:34 +0200 Finished: Thu, 09 May 2019 16:56:34 +0200 Ready: True Restart Count: 0 Environment: Mounts: /configmap from configmap (rw) /etc/rabbitmq from config (rw) /var/lib/rabbitmq from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) Containers: rabbitmq-ha: Container ID: docker://3313cb9f1cbdbe67c5fb19bfb3c2ade0eaf95456917e9aa04636c1c5740b009b Image: qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70 Image ID: docker-pullable://qnowsacr.azurecr.io/external/rabbitmq@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70 Ports: 4369/TCP, 5672/TCP, 15672/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP State: Running Started: Thu, 09 May 2019 16:56:42 +0200 Ready: True Restart Count: 0 Limits: cpu: 8 memory: 55Gi Requests: cpu: 8 memory: 55Gi Environment: MY_POD_NAME: rabbitmq-rabbitmq-ha-0 (v1:metadata.name) RABBITMQ_USE_LONGNAME: true RABBITMQ_NODENAME: rabbit@$(MY_POD_NAME).rabbitmq-rabbitmq-ha-discovery.qa.svc.cluster.local K8S_HOSTNAME_SUFFIX: .rabbitmq-rabbitmq-ha-discovery.qa.svc.cluster.local K8S_SERVICE_NAME: rabbitmq-rabbitmq-ha-discovery RABBITMQ_ERLANG_COOKIE: <set to the key 'rabbitmq-erlang-cookie' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_USER: <set to the key 'rabbitmq-admin-username' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_PASS: <set to the key 'rabbitmq-admin-password' in secret 'rabbitmq-provided'> Optional: false RABBITMQ_DEFAULT_VHOST: / Mounts: /etc/rabbitmq from config (rw) /var/lib/rabbitmq from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) rabbitmq-ha-exporter: Container ID: docker://ffe6e5f755d9b77eb62addbccfe557756c21c73debb4274915e193f72fcbc4f6 Image: qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21 Image ID: docker-pullable://qnowsacr.azurecr.io/external/rabbitmq-exporter@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21 Port: 9419/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: ContainerCannotRun Message: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"read init-p: connection reset by peer\"": unknown Exit Code: 128 Started: Thu, 09 May 2019 16:56:58 +0200 Finished: Thu, 09 May 2019 16:56:58 +0200 Ready: False Restart Count: 2 Environment: PUBLISH_PORT: 9419 RABBIT_CAPABILITIES: bert,no_sort RABBIT_USER: admin RABBIT_PASSWORD: <set to the key 'rabbitmq-admin-password' in secret 'rabbitmq-provided'> Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from rabbitmq-token-6qx8h (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-rabbitmq-rabbitmq-ha-0 ReadOnly: false config: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
configmap: Type: ConfigMap (a volume populated by a ConfigMap) Name: rabbitmq Optional: false rabbitmq-token-6qx8h: Type: Secret (a volume populated by a Secret) SecretName: rabbitmq-token-6qx8h Optional: false QoS Class: Burstable Node-Selectors: agentpool=static Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled 1m default-scheduler Successfully assigned qa/rabbitmq-rabbitmq-ha-0 to k8s-static-11480702-vmss000013 Normal SuccessfulAttachVolume 1m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3f916d40-5615-11e9-9006-000d3ab8e732" Normal Pulling 52s kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383" Normal Pulled 51s kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/busybox:1.29.2@sha256:3058e3a1129c64da64d5c7889e6eedb0666262d7ee69b289f2d4379f69362383" Normal Created 51s kubelet, k8s-static-11480702-vmss000013 Created container copy-rabbitmq-config Normal Started 51s kubelet, k8s-static-11480702-vmss000013 Started container copy-rabbitmq-config Normal Pulling 50s kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70" Normal Pulled 45s kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/rabbitmq:3.7.8-management-alpine@sha256:062935e77e35e8e7d677decf841cabf0c7c84d80d3d8ea362ad612c2d3c05e70" Normal Created 44s kubelet, k8s-static-11480702-vmss000013 Created container rabbitmq-ha Normal Started 43s kubelet, k8s-static-11480702-vmss000013 Started container rabbitmq-ha Normal Pulling 27s (x3 over 43s) kubelet, k8s-static-11480702-vmss000013 Pulling image "qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21" Normal Pulled 27s (x3 over 42s) kubelet, k8s-static-11480702-vmss000013 Successfully pulled image "qnowsacr.azurecr.io/external/rabbitmq-exporter:v0.29.0@sha256:424c036132bfe7f31674eb9a4d0c60395ec6fd794ab08e5eda6f206e13984b21" Normal Created 27s (x3 over 42s) kubelet, k8s-static-11480702-vmss000013 Created container rabbitmq-ha-exporter Warning Failed 26s (x3 over 41s) kubelet, k8s-static-11480702-vmss000013 Error: failed to start container "rabbitmq-ha-exporter": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"read init-p: connection reset by peer\"": unknown Warning BackOff 8s (x3 over 12s) kubelet, k8s-static-11480702-vmss000013 Back-off restarting failed container

Most of the systems pods run, but some of them (calico for example) do crash too.

Cluster was initially set up with ACS-Engine 0.24.1 (k8s 1.10.9) and upgraded successfully to k8s 1.11.6 with AKS-Engine 0.29.1.

What you expected to happen:

Cluster running normally, with our workloads, that used to be running fine until upgrading with 035.1.

How to reproduce it (as minimally and precisely as possible): Initial setup of Cluster with acs-engine 0.24.1:

Upgrade to 1.11.5 with AKS-Engine 0.29.1 (successful) Upgrade to 1.11.6 with AKS-Engine 0.29.1 (successful) Upgrade to 1.14.1 via 1.12.8 and 1.13.5 (three steps) with AKS-Engine 0.35.1

Anything else we need to know: Luckily this was our test to upgrade on our staging environment before doing the actual upgrade of our PROD envionment.

chreichert commented 5 years ago

Comments / Ideas anybody? Issue prevents us from upgrading our PROD environment at the moment. Help is very much appreciated.

jackfrancis commented 5 years ago

Hi @chreichert, could you paste the following output from your cluster?

Thanks!

adamlundrigan commented 5 years ago

We had an AKS cluster running k8s 1.13.5, built using Terraform, which we just upgraded to 1.14.0 over the weekend and now the MongoDB replica set (chart) which ran fine on 1.13.5 explodes under the tiniest load with this same error.

Some logs showing the failure ``` 2019-05-20T23:47:46.158+0000 I REPL [replexec-2] Starting an election, since we've seen no PRIMARY in the past 10000ms 2019-05-20T23:47:46.158+0000 I REPL [replexec-0] VoteRequester(term 116 dry run) received a yes vote from mongo-mongodb-replicaset-0.mongo-mongodb-replicaset.common.svc.cluster.local:27017 response message: { term: 116, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1558395741, 1), $clusterTime: { clusterTime: Timestamp(1558395741, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2019-05-20T23:47:46.158+0000 I REPL [replexec-0] dry election run succeeded, running for election in term 117 2019-05-20T23:47:46.158+0000 I REPL [replexec-2] conducting a dry run election to see if we could be elected. current term: 116 2019-05-20T23:47:46.160+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 - HostUnreachable: Connection refused 2019-05-20T23:47:46.160+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 due to failed operation on a connection 2019-05-20T23:47:46.163+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 2019-05-20T23:47:46.164+0000 I REPL [replexec-2] VoteRequester(term 117) failed to receive response from mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017: HostUnreachable: Connection refused 2019-05-20T23:47:46.164+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 - HostUnreachable: Connection refused 2019-05-20T23:47:46.164+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to mongo-mongodb-replicaset-1.mongo-mongodb-replicaset.common.svc.cluster.local:27017 due to failed operation on a connection 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] transition to PRIMARY from SECONDARY 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] VoteRequester(term 117) received a yes vote from mongo-mongodb-replicaset-0.mongo-mongodb-replicaset.common.svc.cluster.local:27017 response message: { term: 117, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1558395741, 1), $clusterTime: { clusterTime: Timestamp(1558395741, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2019-05-20T23:47:46.168+0000 F - [replexec-1] terminate() called. An exception is active attempting to gather more information 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] Resetting sync source to empty, which was :27017 2019-05-20T23:47:46.168+0000 I REPL [replexec-1] election succeeded, assuming primary role in term 117 mongod(_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv+0x99F) [0x55a256b38b5f] ----- END BACKTRACE ----- mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55a2573635a1] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl22_onVoteRequestCompleteEx+0x2BE) [0x55a256235f4e] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl24_cancelHeartbeats_inlockEv+0xE1) [0x55a256237241] mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x55a2574575f6] {"backtrace":[{"b":"55A2550F9000","o":"226A5A1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55A2550F9000","o":"2269F85"},{"b":"55A2550F9000","o":"235E5F6","s":"_ZN10__cxxabiv111__terminateEPFvvE"},{"b":"55A2550F9000","o":"235E641"},{"b":"55A2550F9000","o":"1A3FB5F","s":"_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv"},{"b":"55A2550F9000","o":"1A403B8","s":"_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE"},{"b":"55A2550F9000","o":"1CE472C","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESD_St11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1CE4F2D","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESt11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1CE5EAE","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor6cancelERKNS0_12TaskExecutor14CallbackHandleE"},{"b":"55A2550F9000","o":"113E241","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_cancelHeartbeats_inlockEv"},{"b":"55A2550F9000","o":"1144671","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl25_restartHeartbeats_inlockEv"},{"b":"55A2550F9000","o":"1129C2E","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl40_postWonElectionUpdateMemberState_inlockEv"},{"b":"55A2550F9000","o":"113CF4E","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl22_onVoteRequestCompleteEx"},{"b":"55A2550F9000","o":"1CE3AB3","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE"},{"b":"55A2550F9000","o":"1CE3F9B"},{"b":"55A2550F9000","o":"1A3C34C","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"55A2550F9000","o":"1A3C84C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"55A2550F9000","o":"1A3D236","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"55A2550F9000","o":"2379850"},{"b":"7FA82DAE0000","o":"76BA"},{"b":"7FA82D716000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.12", "gitVersion" : "c2b9acad0248ca06b14ef1640734b5d0595b55f1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.15.0-1042-azure", "version" : "#46-Ubuntu SMP Thu Apr 4 16:30:23 UTC 2019", "machine" : "x86_64" }, "somap" : [ { "b" : "55A2550F9000", "elfType" : 3, "buildId" : "2B5EE1E50AC12CC569CE7CD8B7812FF349257B77" }, { "b" : "7FFFBF1E6000", "elfType" : 3, "buildId" : "DD321E9190D9BD55E4CD0080B2F9A163099EBD04" }, { "b" : "7FA82ECD6000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "50A923F8DAFECBCD969C8573116A38C18D0E24D5" }, { "b" : "7FA82E891000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "15FFEB43278726B025F020862BF51302822A40EC" }, { "b" : "7FA82E628000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF69EA60EBE05F2DD689D2B26FC85A73E5FBC3A0" }, { "b" : "7FA82E424000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "37BFC3D8F7E3B022DAC7943B1A5FACD40CEBF0AD" }, { "b" : "7FA82E21C000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "69143E8B39040C964D3958490535322675F15DD3" }, { "b" : "7FA82DF13000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "BAD67A84E56E73D031AE507261DA066B35949D34" }, { "b" : "7FA82DCFD000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7FA82DAE0000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "B17C21299099640A6D863E423D99265824E7BB16" }, { "b" : "7FA82D716000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "1CA54A6E0D76188105B12E49FE6B8019BF08803A" }, { "b" : "7FA82EEF1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "C0ADBAD6F9A33944F2B3567C078EC472A1DAE98E" } ] }} mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xBC) [0x55a256b3584c] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl25_restartHeartbeats_inlockEv+0x11) [0x55a25623d671] mongod(+0x2379850) [0x55a257472850] mongod(+0x2269F85) [0x55a257362f85] libc.so.6(clone+0x6D) [0x7fa82d81d41d] 2019-05-20T23:47:46.188+0000 F - [replexec-1] std::exception::what(): Resource temporarily unavailable mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x96) [0x55a256b36236] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESD_St11unique_lockISt5mutexE+0x24C) [0x55a256ddd72c] mongod(+0x235E641) [0x55a257457641] mongod(_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE+0x398) [0x55a256b393b8] libpthread.so.0(+0x76BA) [0x7fa82dae76ba] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl40_postWonElectionUpdateMemberState_inlockEv+0x15E) [0x55a256222c2e] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor6cancelERKNS0_12TaskExecutor14CallbackHandleE+0x14E) [0x55a256ddeeae] mongod(+0x1CE3F9B) [0x55a256ddcf9b] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x1B3) [0x55a256ddcab3] mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x14C) [0x55a256b3534c] Actual exception type: std::system_error 0x55a2573635a1 0x55a257362f85 0x55a2574575f6 0x55a257457641 0x55a256b38b5f 0x55a256b393b8 0x55a256ddd72c 0x55a256dddf2d 0x55a256ddeeae 0x55a256237241 0x55a25623d671 0x55a256222c2e 0x55a256235f4e 0x55a256ddcab3 0x55a256ddcf9b 0x55a256b3534c 0x55a256b3584c 0x55a256b36236 0x55a257472850 0x7fa82dae76ba 0x7fa82d81d41d mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS1_13CallbackStateEESaIS6_EEERKSt14_List_iteratorIS6_ESt11unique_lockISt5mutexE+0x4D) [0x55a256dddf2d] ----- BEGIN BACKTRACE ----- ```
kubectl get nodes -o wide ``` NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aks-xxxprodpool-29478185-0 Ready agent 47h v1.14.0 10.20.0.66 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-1 Ready agent 47h v1.14.0 10.20.0.35 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-2 Ready agent 47h v1.14.0 10.20.0.4 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 aks-xxxprodpool-29478185-4 Ready agent 2d v1.14.0 10.20.0.129 Ubuntu 16.04.6 LTS 4.15.0-1042-azure docker://3.0.4 ```
kubectl get pods --all-namespaces _I had already scaled the StatefulSet running the MongoDB down to zero at this point_ ``` NAMESPACE NAME READY STATUS RESTARTS AGE tenant1 sso-app-5d66f5647-b5wxg 1/1 Running 0 2d tenant1 sso-web-6cfb7767f7-hppdp 1/1 Running 0 47h tenant1 www-app-85969b4d84-qhjrg 1/1 Running 0 47h tenant1 www-web-7b9f9dbf7c-92gxg 1/1 Running 0 47h common redis-master-0 1/1 Running 0 2d common redis-slave-0 1/1 Running 1 2d common redis-slave-1 1/1 Running 0 47h tenant0 sso-app-bc87d4995-dcb5q 1/1 Running 0 47h tenant0 sso-web-6cfb7767f7-cmbb9 1/1 Running 0 2d tenant0 www-app-787c868777-58mbt 1/1 Running 0 47h tenant0 www-web-5d5455bf44-2fdjx 1/1 Running 0 2d kube-system azure-cni-networkmonitor-cgtxj 1/1 Running 0 47h kube-system azure-cni-networkmonitor-d4cfg 1/1 Running 0 2d kube-system azure-cni-networkmonitor-ssmrx 1/1 Running 0 2d kube-system azure-cni-networkmonitor-vwjmf 1/1 Running 0 47h kube-system azure-ip-masq-agent-47cv8 1/1 Running 0 47h kube-system azure-ip-masq-agent-ghwkv 1/1 Running 0 47h kube-system azure-ip-masq-agent-klfnr 1/1 Running 0 2d kube-system azure-ip-masq-agent-zvsgr 1/1 Running 0 2d kube-system coredns-74d5c9d599-6mnq7 1/1 Running 0 47h kube-system coredns-74d5c9d599-qqqbr 1/1 Running 0 47h kube-system coredns-autoscaler-6946b57db6-jnlfw 1/1 Running 0 47h kube-system kube-proxy-8d5sq 1/1 Running 0 47h kube-system kube-proxy-g5q5h 1/1 Running 0 2d kube-system kube-proxy-kjw22 1/1 Running 0 47h kube-system kube-proxy-m6ppb 1/1 Running 0 2d kube-system kube-svc-redirect-bf47c 2/2 Running 0 2d kube-system kube-svc-redirect-fjd9p 2/2 Running 0 47h kube-system kube-svc-redirect-kfcqt 2/2 Running 0 47h kube-system kube-svc-redirect-lfd54 2/2 Running 0 2d kube-system kubernetes-dashboard-c4f4999c8-lxxc9 1/1 Running 2 47h kube-system metrics-server-766dd9f7fd-v457r 1/1 Running 0 2d kube-system nginx-ingress-controller-65c869bb6d-clkxf 1/1 Running 0 2d kube-system nginx-ingress-controller-65c869bb6d-h6t7l 1/1 Running 0 2d kube-system nginx-ingress-default-backend-647c8f49bb-8tnsn 1/1 Running 0 2d kube-system omsagent-8tzfs 1/1 Running 0 47h kube-system omsagent-j2wrh 1/1 Running 0 2d kube-system omsagent-m9d6f 1/1 Running 0 2d kube-system omsagent-rs-79f67c9ffc-zx9l6 1/1 Running 0 47h kube-system omsagent-xc5nd 1/1 Running 1 47h kube-system tiller-deploy-664d6bdc7b-7zbkl 1/1 Running 0 47h kube-system tunnelfront-98dc59889-nx7kz 1/1 Running 0 2d tenant3 sso-app-76bd57569-gm44b 1/1 Running 0 47h tenant3 sso-web-6cfb7767f7-4hdhg 1/1 Running 0 2d tenant3 www-app-8659dd67bd-4jjbr 1/1 Running 0 2d tenant3 www-web-fbd7d8f47-566hv 1/1 Running 0 2d tenantshared cms-app-7d9bdc7f9-4jmwg 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-4q54j 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-pvnhf 1/1 Running 0 47h tenantshared cms-app-7d9bdc7f9-rt6m5 1/1 Running 1 2d tenantshared cms-web-5f698b7f58-xrbs4 1/1 Running 0 47h tenantshared cms-web-5f698b7f58-zdzqt 1/1 Running 0 2d tenantshared crm-app-564c9dcc89-6xv4w 1/1 Running 0 47h tenantshared crm-app-564c9dcc89-pnmj8 1/1 Running 0 47h tenantshared crm-app-564c9dcc89-twsms 1/1 Running 1 2d tenantshared crm-app-564c9dcc89-wb8xq 1/1 Running 0 47h tenantshared crm-memcached-0 1/1 Running 0 47h tenantshared crm-memcached-1 1/1 Running 0 47h tenantshared crm-web-598cf986d-z8zdw 1/1 Running 0 47h tenantshared lrs-api-55779fd8b8-fqkh7 1/1 Running 0 5h32m tenantshared lrs-api-55779fd8b8-lcwhg 1/1 Running 0 9h tenantshared lrs-app-7dcf7858bb-jz97l 1/1 Running 0 9h tenantshared lrs-app-7dcf7858bb-vmps7 1/1 Running 0 5h33m tenantshared lrs-web-57d7786dc-cjlmf 1/1 Running 0 5h33m tenantshared lrs-web-57d7786dc-fktt7 1/1 Running 0 47h tenantshared lrs-worker-865545fc94-xmgbn 1/1 Running 0 5h33m tenantshared lrs-worker-865545fc94-xzqd2 1/1 Running 0 9h tenantshared lrs-xapi-787db988c-bclb5 1/1 Running 0 9h tenantshared lrs-xapi-787db988c-d5kxg 1/1 Running 0 5h35m tenantshared lrs-xapi-787db988c-mcxgb 1/1 Running 0 5h35m tenant2 sso-app-7cf76596df-tg9g5 1/1 Running 0 47h tenant2 sso-web-6cfb7767f7-s56lk 1/1 Running 0 2d tenant2 www-app-6b8c5574c-nqrd7 1/1 Running 0 47h tenant2 www-web-78b6566986-mjhr7 1/1 Running 0 2d ```
chreichert commented 5 years ago

Unfortunately I killed my cluster trying to downgrade. I will try to revive the cluster with an upgrade using 0.36.0 in the next few days. I will report the results here then.

The following was found in my shell history, unfortunately no "nodes -o wide":

kubectl get nodes

``` NAME STATUS ROLES AGE VERSION k8s-dynamic-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-dynamic-11480702-vmss0000ks Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000kt Ready agent 44d v1.14.1 k8s-dynamic-11480702-vmss0000md Ready agent 35d v1.14.1 k8s-elastic-11480702-vmss000000 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000001 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000002 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000003 Ready agent 44d v1.14.1 k8s-elastic-11480702-vmss000004 Ready agent 44d v1.14.1 k8s-graph-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-master-11480702-0 Ready master 8m11s v1.14.1 k8s-master-11480702-1 Ready master 4h8m v1.14.1 k8s-master-11480702-2 Ready master 5h10m v1.14.1 k8s-static-11480702-vmss000000 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000001 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000002 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000003 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000004 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000005 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000006 Ready agent 194d v1.14.1 k8s-static-11480702-vmss000013 Ready agent 35d v1.14.1 ```

kubectl get pods --all-namespaces

``` NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default omsagent-msoms-24d2c 1/1 Running 0 19h 10.244.2.2 k8s-elastic-11480702-vmss000001 default omsagent-msoms-25wh2 1/1 Running 0 19h 10.244.4.2 k8s-elastic-11480702-vmss000003 default omsagent-msoms-2tdlf 1/1 Running 2 19h 10.244.104.2 k8s-dynamic-11480702-vmss0000md default omsagent-msoms-645ks 1/1 Running 0 19h 10.244.6.2 k8s-elastic-11480702-vmss000002 default omsagent-msoms-7fsg5 1/1 Running 0 19h 10.244.20.2 k8s-static-11480702-vmss000005 default omsagent-msoms-7hfpd 1/1 Running 0 19h 10.244.1.2 k8s-elastic-11480702-vmss000000 default omsagent-msoms-ch2fn 1/1 Running 0 19h 10.244.5.4 k8s-static-11480702-vmss000004 default omsagent-msoms-d8mnv 1/1 Running 0 19h 10.244.7.2 k8s-static-11480702-vmss000002 default omsagent-msoms-df49h 1/1 Running 1 19h 10.244.22.3 k8s-dynamic-11480702-vmss000002 default omsagent-msoms-dtvlk 1/1 Running 0 19h 10.244.92.2 k8s-static-11480702-vmss000013 default omsagent-msoms-fsdfl 1/1 Running 0 19h 10.244.0.2 k8s-elastic-11480702-vmss000004 default omsagent-msoms-lc8wk 1/1 Running 0 151m 10.244.3.2 k8s-dynamic-11480702-vmss0000oh default omsagent-msoms-lnxxw 1/1 Running 1 19h 10.244.27.3 k8s-graph-11480702-vmss000000 default omsagent-msoms-mxdqw 1/1 Running 1 151m 10.244.23.2 k8s-dynamic-11480702-vmss0000of default omsagent-msoms-n65j2 1/1 Running 1 19h 10.244.10.2 k8s-dynamic-11480702-vmss0000ks default omsagent-msoms-nxjdn 1/1 Running 0 19h 10.244.9.2 k8s-static-11480702-vmss000000 default omsagent-msoms-qvvkv 1/1 Running 0 19h 10.244.17.2 k8s-static-11480702-vmss000001 default omsagent-msoms-rl48s 1/1 Running 1 150m 10.244.25.2 k8s-dynamic-11480702-vmss0000og default omsagent-msoms-sbptz 1/1 Running 0 143m 10.244.21.2 k8s-static-11480702-vmss000014 default omsagent-msoms-tjd86 1/1 Running 0 19h 10.244.14.2 k8s-dynamic-11480702-vmss000001 default omsagent-msoms-zgxpj 1/1 Running 0 143m 10.244.11.2 k8s-static-11480702-vmss000015 kube-system azure-ip-masq-agent-4z52z 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 kube-system azure-ip-masq-agent-5625l 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 kube-system azure-ip-masq-agent-5dg5v 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 kube-system azure-ip-masq-agent-5nxnq 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og kube-system azure-ip-masq-agent-bdg5h 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 kube-system azure-ip-masq-agent-brpqr 1/1 Running 0 145m 10.239.0.29 k8s-static-11480702-vmss000015 kube-system azure-ip-masq-agent-bsmww 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 kube-system azure-ip-masq-agent-bvr9j 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 kube-system azure-ip-masq-agent-c56xv 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh kube-system azure-ip-masq-agent-ckxgd 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 kube-system azure-ip-masq-agent-djs8p 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 kube-system azure-ip-masq-agent-dp79d 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 kube-system azure-ip-masq-agent-g2t4r 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 kube-system azure-ip-masq-agent-gf8c4 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 kube-system azure-ip-masq-agent-gqflb 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks kube-system azure-ip-masq-agent-jjhfd 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md kube-system azure-ip-masq-agent-llktn 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 kube-system azure-ip-masq-agent-lmf95 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of kube-system azure-ip-masq-agent-p7sww 1/1 Running 0 145m 10.239.0.24 k8s-static-11480702-vmss000014 kube-system azure-ip-masq-agent-qzz2d 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 kube-system azure-ip-masq-agent-sxtfx 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 kube-system azure-ip-masq-agent-t45d2 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 kube-system azure-ip-masq-agent-xv6dd 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 kube-system azure-ip-masq-agent-znnd8 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 kube-system calico-node-4pl2p 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 kube-system calico-node-4qk46 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 kube-system calico-node-6fkgv 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 kube-system calico-node-9p6b8 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh kube-system calico-node-9wqqj 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of kube-system calico-node-bf76m 0/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 kube-system calico-node-dff49 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 kube-system calico-node-dwlgt 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og kube-system calico-node-f7hnw 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 kube-system calico-node-h6cwh 0/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 kube-system calico-node-j8dpz 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 kube-system calico-node-jkwsz 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 kube-system calico-node-kd7qs 1/1 Running 0 146m 10.239.0.24 k8s-static-11480702-vmss000014 kube-system calico-node-krz9p 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 kube-system calico-node-ltvp7 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks kube-system calico-node-ms454 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 kube-system calico-node-ntsrt 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 kube-system calico-node-pkps4 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md kube-system calico-node-q8jsj 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 kube-system calico-node-vs2vn 1/1 Running 0 146m 10.239.0.29 k8s-static-11480702-vmss000015 kube-system calico-node-xfwbv 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 kube-system calico-node-xz6p9 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 kube-system calico-node-zgdkv 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 kube-system calico-node-zp885 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 kube-system calico-typha-5bd7cfc5fb-9nf47 1/1 Running 0 150m 10.239.0.26 k8s-static-11480702-vmss000004 kube-system calico-typha-5bd7cfc5fb-t5mgt 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks kube-system calico-typha-horizontal-autoscaler-b4f9b8fdc-7hx2p 1/1 Running 0 19h 10.244.5.2 k8s-static-11480702-vmss000004 kube-system coredns-7bc69b7975-wrrkr 1/1 Running 0 19h 10.244.14.3 k8s-dynamic-11480702-vmss000001 kube-system heapster-66bb49cb7b-vxmsx 2/2 Running 1 150m 10.244.0.3 k8s-elastic-11480702-vmss000004 kube-system kube-addon-manager-k8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 kube-system kube-addon-manager-k8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 kube-system kube-addon-manager-k8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 kube-system kube-apiserver-k8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 kube-system kube-apiserver-k8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 kube-system kube-apiserver-k8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 kube-system kube-controller-manager-k8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 kube-system kube-controller-manager-k8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 kube-system kube-controller-manager-k8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 kube-system kube-proxy-2xkd7 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 kube-system kube-proxy-4w4qj 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 kube-system kube-proxy-59jmc 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 kube-system kube-proxy-62rdx 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of kube-system kube-proxy-68zrs 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 kube-system kube-proxy-dbvhg 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 kube-system kube-proxy-dccdg 1/1 Running 0 151m 10.239.0.10 k8s-dynamic-11480702-vmss0000og kube-system kube-proxy-dh8ms 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 kube-system kube-proxy-f7jwn 1/1 Running 0 19h 10.239.255.12 k8s-master-11480702-2 kube-system kube-proxy-fxt4b 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 kube-system kube-proxy-k6mpx 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh kube-system kube-proxy-kxrd5 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md kube-system kube-proxy-ljbhj 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 kube-system kube-proxy-nmftd 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 kube-system kube-proxy-p4d8j 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 kube-system kube-proxy-p729w 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 kube-system kube-proxy-qrnvw 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 kube-system kube-proxy-rsbhc 1/1 Running 0 145m 10.239.0.24 k8s-static-11480702-vmss000014 kube-system kube-proxy-s9jps 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 kube-system kube-proxy-sb4bn 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 kube-system kube-proxy-szvgn 1/1 Running 0 4h46m 10.239.255.10 k8s-master-11480702-0 kube-system kube-proxy-wzwn2 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks kube-system kube-proxy-x98lv 1/1 Running 0 145m 10.239.0.29 k8s-static-11480702-vmss000015 kube-system kube-proxy-xnxqz 1/1 Running 0 19h 10.239.255.11 k8s-master-11480702-1 kube-system kube-scheduler-k8s-master-11480702-0 1/1 Running 0 4h45m 10.239.255.10 k8s-master-11480702-0 kube-system kube-scheduler-k8s-master-11480702-1 1/1 Running 0 23h 10.239.255.11 k8s-master-11480702-1 kube-system kube-scheduler-k8s-master-11480702-2 1/1 Running 0 23h 10.239.255.12 k8s-master-11480702-2 kube-system kubernetes-dashboard-7b5859758b-k788b 1/1 Running 0 19h 10.244.27.2 k8s-graph-11480702-vmss000000 kube-system metrics-server-5fdc668b9b-vx8pl 1/1 Running 0 19h 10.244.22.2 k8s-dynamic-11480702-vmss000002 kube-system tiller-deploy-88c69b9b-x5k6p 1/1 Running 0 150m 10.244.27.4 k8s-graph-11480702-vmss000000 qa prometheus-kube-state-metrics-54bd47b45f-8b726 1/1 Running 0 19h 10.244.92.3 k8s-static-11480702-vmss000013 qa prometheus-node-exporter-2tddg 1/1 Running 0 19h 10.239.0.6 k8s-elastic-11480702-vmss000001 qa prometheus-node-exporter-58dct 1/1 Running 0 19h 10.239.0.8 k8s-dynamic-11480702-vmss000001 qa prometheus-node-exporter-5969s 1/1 Running 0 19h 10.239.0.12 k8s-static-11480702-vmss000001 qa prometheus-node-exporter-6fwlr 1/1 Running 0 19h 10.239.0.27 k8s-static-11480702-vmss000005 qa prometheus-node-exporter-9lpt5 1/1 Running 0 19h 10.239.0.4 k8s-graph-11480702-vmss000000 qa prometheus-node-exporter-b5w9m 1/1 Running 0 19h 10.239.0.13 k8s-dynamic-11480702-vmss0000ks qa prometheus-node-exporter-c929k 1/1 Running 0 143m 10.239.0.29 k8s-static-11480702-vmss000015 qa prometheus-node-exporter-cb2gp 1/1 Running 0 143m 10.239.0.24 k8s-static-11480702-vmss000014 qa prometheus-node-exporter-ch6tc 1/1 Running 0 150m 10.239.0.10 k8s-dynamic-11480702-vmss0000og qa prometheus-node-exporter-hmndz 1/1 Running 0 19h 10.239.0.16 k8s-elastic-11480702-vmss000003 qa prometheus-node-exporter-j9hbm 1/1 Running 0 151m 10.239.0.7 k8s-dynamic-11480702-vmss0000of qa prometheus-node-exporter-jqqth 1/1 Running 0 19h 10.239.0.5 k8s-elastic-11480702-vmss000000 qa prometheus-node-exporter-k47b9 1/1 Running 0 19h 10.239.0.20 k8s-dynamic-11480702-vmss0000md qa prometheus-node-exporter-ktxpw 1/1 Running 0 19h 10.239.0.18 k8s-static-11480702-vmss000013 qa prometheus-node-exporter-m6l9n 1/1 Running 0 19h 10.239.0.15 k8s-elastic-11480702-vmss000002 qa prometheus-node-exporter-nsgcf 1/1 Running 0 19h 10.239.0.9 k8s-dynamic-11480702-vmss000002 qa prometheus-node-exporter-qfwb2 1/1 Running 0 151m 10.239.0.14 k8s-dynamic-11480702-vmss0000oh qa prometheus-node-exporter-sxctq 1/1 Running 0 19h 10.239.0.11 k8s-static-11480702-vmss000000 qa prometheus-node-exporter-t2dpb 1/1 Running 0 19h 10.239.0.17 k8s-elastic-11480702-vmss000004 qa prometheus-node-exporter-wljxr 1/1 Running 0 19h 10.239.0.26 k8s-static-11480702-vmss000004 qa prometheus-node-exporter-xrr5q 1/1 Running 0 19h 10.239.0.23 k8s-static-11480702-vmss000002 qa qa-ingress-nginx-ingress-controller-c48c4ccb-lmk2l 1/1 Running 4 19h 10.244.5.3 k8s-static-11480702-vmss000004 qa qknows-default-backend-7d68787864-zdtm4 1/1 Running 0 150m 10.244.104.3 k8s-dynamic-11480702-vmss0000md qa rabbitmq-rabbitmq-ha-0 1/2 RunContainerError 7 3m55s 10.244.17.3 k8s-static-11480702-vmss000001 ```

jackfrancis commented 5 years ago

@chreichert yes, thank you, it would make more sense that these failures were reproducible on new clusters as well, and not isolated to clusters having traversed the upgrade flow.

(Though that's arguably a worse error, it's at least easier to reason/isolate as being something in 1.14)

chreichert commented 5 years ago

@jackfrancis I was able to recover my QA cluster by force downgrading it to 1.13.5 with aks-engine 0.34.3. and a little help of directly deploying manually patched templates with az group deployments.

(Problem was that aks-engine upgrade fails after every master deployment in our Azure AD enabled cluster, which does not go well with --force. In a multi master setup this only downgrades the first master and the agents. The remaining masters must be done with az deployments, which I finally manged to get done.)

First tests are promising, that everything is running again like before. Will do some more tests tomorrow.

So it looks like it really might be something in 1.14...

tomgallard commented 5 years ago

Just a note that we are seeing exactly this issue occur on our AKS QA cluster after upgrade to 1.14.0 . Affecting Nginx Ingress pods only as far as I can see.

Have raised a support ticket as well.

tomgallard commented 5 years ago

Worth noting for anyone else experiencing this issue that we resolved it for now by adding explicit cpu and memory limits to the nginx-ingress deployment. Not sure why this fixed things but it did.

chreichert commented 5 years ago

Unfortunately it does not help in our case. We already had cpu and memory limits defined and rabbitmq pods are crashing, after 1.14.1 upgrade.

jackfrancis commented 5 years ago

@chreichert to confirm: have we repro'd this on a newly built 1.14.n cluster as well?

chreichert commented 5 years ago

@jackfrancis I will try to repro this with our apimodel with a newly built 1.14.1 cluster tomorrow. I will get back after doing this.

I could reproduce the issue, setting up a new cluster and upgrading it through the instances to 1.14.1 (following our initial path for our production environment):

My steps to reproduce:

  1. Setup a fresh 1.10.7 cluster using ACS-Engine 0.21.2

    api-model

    ``` { "apiVersion": "vlabs", "properties": { "orchestratorProfile": { "orchestratorType": "Kubernetes", "orchestratorRelease": "1.10", "kubernetesConfig": { "addons": [ { "name": "blobfuse-flexvolume", "enabled": false }, { "name": "smb-flexvolume", "enabled": false }, { "name": "keyvault-flexvolume", "enabled": false }, { "name": "cluster-autoscaler", "enabled": false, "containers": [ { "name": "cluster-autoscaler", "cpuRequests": "100m", "memoryRequests": "300Mi", "cpuLimits": "100m", "memoryLimits": "300Mi" } ], "config": { "maxNodes": "5", "minNodes": "1" } } ], "enableRbac": true, "privateCluster": { "enabled": true }, "networkPlugin": "kubenet", "networkPolicy": "calico", "cloudProviderBackoff": true, "cloudProviderBackoffRetries": 6, "cloudProviderBackoffJitter": 1, "cloudProviderBackoffDuration": 5, "cloudProviderBackoffExponent": 1.5, "cloudProviderRateLimit": false, "cloudProviderRateLimitQPS": 3, "cloudProviderRateLimitBucket": 10 } }, "aadProfile": { "serverAppID": "***", "clientAppID": "***", "tenantID": "***" }, "masterProfile": { "count": 3, "dnsPrefix": "***", "vmSize": "Standard_D2s_v3", "OSDiskSizeGB": 128, "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet", "firstConsecutiveStaticIP": "10.239.255.10", "vnetCidr": "10.239.0.0/16" }, "agentPoolProfiles": [ { "name": "dynamic", "count": 4, "vmSize": "Standard_D16s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "graph", "count": 1, "vmSize": "Standard_E32s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "static", "count": 8, "vmSize": "Standard_D16s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" }, { "name": "elastic", "count": 5, "vmSize": "Standard_D32s_v3", "OSDiskSizeGB": 128, "storageProfile": "ManagedDisks", "availabilityProfile": "VirtualMachineScaleSets", "vnetSubnetId": "/subscriptions/***/resourceGroups/***/providers/Microsoft.Network/virtualNetworks/kubernetes-vnet/subnets/kubernetes-subnet" } ], "linuxProfile": { "adminUsername": "azureuser", "ssh": { "publicKeys": [ { "keyData": "***" } ] } }, "servicePrincipalProfile": { "clientId": "***", "secret": "***" } } } ```

  2. Upgrade cluster to 1.11.5 using ACS-Engine 0.26.2 -> working fine

  3. Upgrade cluster to 1.11.6 using AKS-Engine 0.29.1 -> working fine (representing current PROD state)

  4. Upgrade cluster to 1.12.8 using AKS-Engine 0.36.4 -> some manual tweaking necessary due to Calico upgrade but finally working fine (first master became ready after adding tolerations to calico-node pod, Agent nodes needed Re-Imaging to become ready)

  5. Upgrade cluster to 1.13.5 using AKS-Engine 0.36.4 -> working fine

  6. Upgrade cluster to 1.14.1 using AKS-Engine 0.36.4 -> rabbitmq and other pods crash with the pthread error

chreichert commented 5 years ago

@jackfrancis Today I did a direct setup of a fresh 1.14.1 cluster using AKS-Engine 0.36.4 and our apimodel as posted above. I deployed our application components. Everything is running fine so far. So I would pin down the issue to be something in the upgrade to 1.14.n!

chreichert commented 5 years ago

@jackfrancis Today I found the cause of the issue: I compared the generated apimodel.json from the newly setup (working) 1.14.1 cluster to the one of the updated (not working) cluster: One of the differences was the parameters "--pod-max-pids" in kubeletConfig of kubernetesConfig, masterProfile and all the agentPoolProfiles. In the newly setup cluster the parameters were set to "-1", while in the updated cluster the parameters were set to "100". So I did another test: Before upgrading my (working) 1.13.5 cluster, I patched the apimodel.json and set all occurences of "--pod-max-pids" to "-1". After that I did a normal upgrade using AKS-Engine 0.36.4, that completed without any errors. I deployed our application components and could not find any errors anymore. Everything is running fine on the upgraded cluster now.

So my conclusion is, that the upgrade does not patch the default of "--pod-max-pids", which I believe must be set to "-1" in a "standard" 1.14.n cluster.

Workaround for upgrading clusters initially setup with ACS-Engine or AKS-Engine before 0.34.3, where the change of this parameter was introduced (PR #1126), is manually patching apimodel.json as described above, before upgrading from 1.13.n to 1.14.n.

jackfrancis commented 5 years ago

@chreichert Thanks for the update! I think what's actually happening is this:

Hope that makes sense. We're saying the same thing functionally. I have a PR which should fix upgrade scenarios:

https://github.com/Azure/aks-engine/pull/1508

chreichert commented 5 years ago

/reopen

@jackfrancis Sorry, but the issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3

The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubeletConfig.

The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade.

Workload pods still crash after uprading with 0.37.3.

acs-bot commented 5 years ago

@chreichert: Reopened this issue.

In response to [this](https://github.com/Azure/aks-engine/issues/1270#issuecomment-505779484): >/reopen > >@jackfrancis Sorry, but the issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3 > >The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubelet.config. > >The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade. > >Workload pods still crash after uprading with 0.37.3. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
chreichert commented 5 years ago

@jackfrancis Can you please have a look again? The issue is still occuring after upgrade from 1.13.7 to 1.14.3 with AKS-Engine 0.37.3

The PR only fixes one occurrence of "--pod-max-pids": properties.orchestratorProfile.kubernetesConfig.kubeletConfig.

The other occurences in properties.masterProfile.kubernetesConfig.kubeletConfig and properties.agentPoolProfiles[*].kubernetesConfig still read "--pod-max-pids=100" after upgrade.

Workload pods still crash after upgrading with 0.37.3.

robinkb commented 5 years ago

I am seeing the same symptoms using AKS (non-engine) 1.14.3.