Closed otaviosoares closed 7 years ago
Hi Otavio, I'm not seeing any outages either in Azure. Can you please email me directly your Subscription ID, and I can do some searching on my side. Also, are you still seeing this happen?
Hi anhowe, thanks for the reply.
It got back to work automatically, but as this caused issues in some pods I'm investigating the causes.
I'm sending my subscription id right now.
Closing due to no repro. Please re-open if you are still seeing issue.
Hi. We are having a similar issue.
Our clusters keep going into FullDisruption almost every day, several times a day, for apparently no reason. Is there a way to know the reason for the FullDisruption?
Thanks in advance!
+1
@anhowe I would like to reopen this. I have had several clusters where this happens in northeurope, and one where it still happens (our production cluster). Most of the time, nothing (really) bad actually happens, except that nodes are flagged as NodeNotReady
, but return in a minute or two again, so that the scheduler doesn't start to reschedule pods just yet. Sometimes it does, which can be bad.
Example log from the node controller:
$ kubectl logs -n kube-system kube-controller-manager-k8s-master-99949627-0 | grep northeurope
...
I1020 20:27:11.684316 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:28:14.910194 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:30:26.545049 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:31:25.850936 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state FullDisruption.
I1020 20:31:30.919402 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state Normal.
I1020 20:31:49.351263 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:34:04.081193 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:34:47.381595 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:36:49.869167 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:37:23.262599 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:38:50.914547 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state FullDisruption.
I1020 20:39:01.291642 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state Normal.
I1020 20:39:50.810492 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state FullDisruption.
I1020 20:39:56.568564 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:40:01.683276 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state Normal.
I1020 20:40:23.707296 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:43:56.111213 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:44:23.643136 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 20:51:25.010778 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 20:52:34.817357 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 21:13:49.832162 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state FullDisruption.
I1020 21:14:23.062717 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state Normal.
I1020 21:18:01.085188 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 21:18:59.121567 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 21:27:23.162240 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 21:28:08.683617 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1020 22:05:28.701694 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state FullDisruption.
I1020 22:05:49.462008 1 nodecontroller.go:793] NodeController detected that zone northeurope::1 is now in state Normal.
I1020 22:20:56.036951 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1020 22:21:22.227524 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 00:10:25.256671 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 00:10:46.123385 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 00:12:02.435937 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 00:12:07.436365 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 00:20:40.599278 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 00:21:58.464555 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 00:23:29.074803 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 00:24:13.776809 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 01:17:50.073963 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 01:18:10.197718 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
I1021 04:30:01.571340 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state FullDisruption.
I1021 04:30:21.691475 1 nodecontroller.go:793] NodeController detected that zone northeurope::2 is now in state Normal.
We also see a lot of these, and they seem to be related to the FullDisruption
:
I1021 04:30:00.998034 1 controller_utils.go:285] Recording status change NodeNotReady event message for node k8s-agent-99949627-1
I1021 04:30:00.998628 1 controller_utils.go:203] Update ready status of pods on node [k8s-agent-99949627-1]
I1021 04:30:00.999217 1 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-agent-99949627-1", UID:"ebb34dd1-b5b6-11e7-972b-000d3ab79d5a", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node k8s-agent-99949627-1 status is now: NodeNotReady
I have already filed a support case for this as well.
I have also seen this recently for region europe west. Shortly after, one of the nodes was marked as unresponsive resulting in a cascade of pod killing and reshuffeling, which ended up being really bad for our services...
I0305 19:48:29.722056 1 node_controller.go:832] Controller detected that zone westeurope:^@:1 is now in state FullDisruption.
I0305 19:48:37.341358 1 service_controller.go:633] Detected change in list of current cluster nodes. New node set: [k8s-premagents-12388381-1 k8s-premagents-12388381-4]
I0305 19:50:10.849706 1 azure_loadbalancer.go:273] ensure(default/demo-application): lb(cnnyn-dev-internal) finished
I0305 19:50:20.929191 1 azure_loadbalancer.go:273] ensure(kube-system/kibana-logging): lb(cnnyn-dev-internal) finished
I0305 19:52:02.220410 1 azure_loadbalancer.go:273] ensure(monitoring/grafana): lb(cnnyn-dev-internal) finished
I0305 19:52:09.378932 1 azure_loadbalancer.go:273] ensure(kube-system/nginx-ingress-controller): lb(cnnyn-dev) finished
I0305 19:52:09.378997 1 service_controller.go:641] Successfully updated 47 out of 47 load balancers to direct traffic to the updated set of nodes
I0305 19:52:21.434249 1 node_controller.go:683] Node is unresponsive. Adding Pods on Node k8s-premagents-12388381-0 to eviction queues: 2018-03-05 19:52:21.434226064 +0000 UTC is later than 2018-03-05 19:46:28.969335062 +0000 UTC + 4m20s
i got the very same issue on AKS - FullDisruption crashes nodes:
I1015 05:34:52.627433 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-0:{} aks-agentpool-21298390-1:{}]
I1015 05:41:23.556268 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state FullDisruption.
I1015 05:41:33.245238 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-1:{}]
I1015 05:41:38.690048 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state Normal.
I1015 05:43:14.004137 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-0:{} aks-agentpool-21298390-1:{}]
I1015 05:44:01.650833 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state FullDisruption.
I1015 05:44:54.270200 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-1:{}]
I1015 05:46:33.870625 1 node_lifecycle_controller.go:945] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
I1015 05:46:34.601704 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[]
I1015 05:47:44.118424 1 node_lifecycle_controller.go:972] Controller detected that some Nodes are Ready. Exiting master disruption mode.
I1015 05:47:54.192104 1 node_lifecycle_controller.go:534] Pods awaiting deletion due to Controller eviction
I1015 05:48:14.715905 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-0:{}]
I1015 06:23:24.400342 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::0 is now in state Normal.
I1015 06:23:43.691599 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state FullDisruption.
I1015 06:24:54.986163 1 service_controller.go:636] Detected change in list of current cluster nodes. New node set: map[aks-agentpool-21298390-0:{} aks-agentpool-21298390-2:{}]
I1015 06:24:58.807675 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state Normal.
I1015 06:40:44.334945 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state FullDisruption.
I1015 06:41:29.378454 1 node_lifecycle_controller.go:995] Controller detected that zone westeurope::1 is now in state Normal.
I am getting something very similar on azure vm's with k8s version 1.12.2:
I1113 21:04:58.053316 1 node_lifecycle_controller.go:1065] Controller detected that zone southcentralus::0 is now in state Normal.
I1113 21:10:56.180321 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000000:{} worke423e000002:{}]
I1113 21:10:56.180463 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
E1113 21:11:51.303267 1 node_lifecycle_controller.go:766] Error determining if node worke423e000001 shutdown in cloud: failed to get power status for node "worke423e000001"
E1113 21:11:51.351013 1 node_lifecycle_controller.go:959] Error updating node worke423e000000: Operation cannot be fulfilled on nodes "worke423e000000": the object has been modified; please apply your changes to the latest version and try again
E1113 21:11:51.377746 1 node_lifecycle_controller.go:959] Error updating node cpsoufa8d000000: Operation cannot be fulfilled on nodes "cpsoufa8d000000": the object has been modified; please apply your changes to the latest version and try again
E1113 21:11:51.404737 1 node_lifecycle_controller.go:959] Error updating node cpsoufa8d000001: Operation cannot be fulfilled on nodes "cpsoufa8d000001": the object has been modified; please apply your changes to the latest version and try again
I1113 21:12:36.180669 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000002:{} worke423e000001:{} worke423e000000:{}]
I1113 21:12:36.180706 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 21:22:36.182110 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000001:{} worke423e000000:{}]
I1113 21:22:36.182151 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 21:24:16.182360 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000000:{}]
I1113 21:24:16.182413 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
W1113 21:25:05.881949 1 reflector.go:270] k8s.io/client-go/informers/factory.go:131: watch of *v1beta1.Event ended with: The resourceVersion for the provided watch is too old.
W1113 21:33:51.891877 1 reflector.go:270] k8s.io/client-go/informers/factory.go:131: watch of *v1beta1.Event ended with: The resourceVersion for the provided watch is too old.
I1113 21:37:36.183972 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000002:{} worke423e000000:{}]
I1113 21:37:36.184031 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
E1113 21:38:03.160321 1 node_lifecycle_controller.go:959] Error updating node worke423e000000: Operation cannot be fulfilled on nodes "worke423e000000": the object has been modified; please apply your changes to the latest version and try again
I1113 21:39:16.184195 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000002:{} worke423e000001:{} worke423e000000:{}]
I1113 21:39:16.184244 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 21:45:56.184971 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000001:{} worke423e000000:{}]
I1113 21:45:56.185014 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
W1113 21:47:40.952056 1 reflector.go:270] k8s.io/client-go/informers/factory.go:131: watch of *v1beta1.Event ended with: The resourceVersion for the provided watch is too old.
I1113 21:49:47.294238 1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"kube-system", Name:"calico-kube-controllers-777f9c586b", UID:"e542c162-e729-11e8-9cc0-000d3a733ae4", APIVersion:"apps/v1", ResourceVersion:"71668", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: calico-kube-controllers-777f9c586b-gbf2p
I1113 21:49:47.299842 1 event.go:221] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"kube-system", Name:"coredns-c464566bb", UID:"d9570578-e729-11e8-9cc0-000d3a733ae4", APIVersion:"apps/v1", ResourceVersion:"71665", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: coredns-c464566bb-btx9d
I1113 21:49:47.300846 1 node_lifecycle_controller.go:604] Pods awaiting deletion due to Controller eviction
I1113 21:55:13.896645 1 node_lifecycle_controller.go:1065] Controller detected that zone southcentralus::2 is now in state FullDisruption.
W1113 22:01:56.061059 1 reflector.go:270] k8s.io/client-go/informers/factory.go:131: watch of *v1beta1.Event ended with: The resourceVersion for the provided watch is too old.
E1113 22:11:15.760617 1 node_lifecycle_controller.go:959] Error updating node worke423e000001: Operation cannot be fulfilled on nodes "worke423e000001": the object has been modified; please apply your changes to the latest version and try again
I1113 22:12:01.745778 1 node_lifecycle_controller.go:1065] Controller detected that zone southcentralus::0 is now in state FullDisruption.
I1113 22:12:36.195702 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[]
I1113 22:12:36.195753 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 22:13:17.989597 1 node_lifecycle_controller.go:1065] Controller detected that zone southcentralus::0 is now in state Normal.
I1113 22:14:16.195947 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000001:{} worke423e000000:{}]
I1113 22:14:16.195988 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 22:15:08.487190 1 node_lifecycle_controller.go:1065] Controller detected that zone southcentralus::2 is now in state Normal.
I1113 22:15:56.196207 1 service_controller.go:639] Detected change in list of current cluster nodes. New node set: map[worke423e000001:{} worke423e000000:{} worke423e000002:{}]
I1113 22:15:56.196255 1 service_controller.go:647] Successfully updated 2 out of 2 load balancers to direct traffic to the updated set of nodes
I1113 22:16:24.747534 1 node_lifecycle_controller.go:604] Pods awaiting deletion due to Controller eviction```
I'm getting the Full Disruption occurrences 3 times today in westernurope and pods in one of the worker node gets evicted causing memory in the other worker node over utilised.
I0502 20:22:12.083946 1 node_lifecycle_controller.go:1122] Controller detected that zone westeurope::0 is now in state FullDisruption.
I0502 22:08:19.752793 1 node_lifecycle_controller.go:1122] Controller detected that zone westeurope::2 is now in state FullDisruption.
I0502 22:22:15.109095 1 node_lifecycle_controller.go:1122] Controller detected that zone westeurope::0 is now in state FullDisruption.
Serial console of the affected worker node was generating logs repeatedly,
[4238989.760650] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238989.760652] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238989.760654] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238989.760658] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238989.760677] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238989.760682] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238989.760685] [<ffffffffaf4cf7ca>] ? update_rq_clock.part.78+0x1a/0x150
[4238989.760690] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238989.760693] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238989.760696] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238989.760698] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238989.760700] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238989.760702] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238989.760705] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238989.760706] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238989.760710] <EOI> [<ffffffffafb69aa0>] ? __cpuidle_text_start+0x8/0x8
[4238989.760712] [<ffffffffafb69ca6>] ? native_safe_halt+0x6/0x10
[4238989.760713] [<ffffffffafb69abe>] default_idle+0x1e/0xc0
[4238989.760716] [<ffffffffaf4366f0>] arch_cpu_idle+0x20/0xc0
[4238989.760719] [<ffffffffaf4fc49a>] cpu_startup_entry+0x14a/0x1e0
[4238989.760721] [<ffffffffaf457db7>] start_secondary+0x1f7/0x270
[4238989.760724] [<ffffffffaf4000d5>] start_cpu+0x5/0x14
[4238989.480759] [<ffffffffc0655440>] ? cifs_handle_standard+0x1b0/0x1b0 [cifs]
[4238989.480759] [<ffffffffaf4c1c31>] kthread+0xd1/0xe0
[4238989.480759] [<ffffffffaf4c1b60>] ? insert_kthread_work+0x40/0x40
[4238989.480759] [<ffffffffafb74c37>] ret_from_fork_nospec_begin+0x21/0x21
[4238989.480759] [<ffffffffaf4c1b60>] ? insert_kthread_work+0x40/0x40
[4238989.480759] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[4238989.480759] cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[4238989.480759] node 0: slabs: 5061, objs: 323904, free: 0
[4238991.894196] docker-containe: page allocation failure: order:0, mode:0x20
[4238991.895721] docker-containe: page allocation failure: order:0, mode:0x20
[4238991.895725] CPU: 4 PID: 20591 Comm: docker-containe Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238991.895726] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238991.895727] Call Trace:
[4238991.895738] <IRQ> [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238991.895743] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238991.895748] [<ffffffffaf4e0056>] ? update_curr+0x86/0x1e0
[4238991.895751] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238991.895754] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238991.895757] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238991.895761] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238991.895788] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238991.895794] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238991.895799] [<ffffffffaf4c48e1>] ? run_posix_cpu_timers+0x51/0x830
[4238991.895804] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238991.895808] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238991.895812] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238991.895815] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238991.895819] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238991.895821] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238991.895825] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238991.895827] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238991.895843] <EOI> [<ffffffffaf63143a>] ? mem_cgroup_iter+0x7a/0x2d0
[4238991.895846] [<ffffffffaf63162b>] ? mem_cgroup_iter+0x26b/0x2d0
[4238991.895850] [<ffffffffaf5cc8aa>] shrink_zone+0x9a/0x1a0
[4238991.895852] [<ffffffffaf5ccd90>] do_try_to_free_pages+0xf0/0x4e0
[4238991.895855] [<ffffffffaf5c942a>] ? throttle_direct_reclaim+0xaa/0x2b0
[4238991.895858] [<ffffffffaf5bf7c0>] ? drain_pages+0xb0/0xb0
[4238991.895860] [<ffffffffaf5cd27c>] try_to_free_pages+0xfc/0x180
[4238991.895863] [<ffffffffafb5d1ef>] __alloc_pages_slowpath+0x457/0x724
[4238991.895866] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238991.895868] [<ffffffffaf60df68>] alloc_pages_current+0x98/0x110
[4238991.895871] [<ffffffffaf5b6347>] __page_cache_alloc+0x97/0xb0
[4238991.895873] [<ffffffffaf5b8fa8>] filemap_fault+0x298/0x490
[4238991.895903] [<ffffffffc01ccd0e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
[4238991.895922] [<ffffffffc01ccf0c>] xfs_filemap_fault+0x2c/0x30 [xfs]
[4238991.895927] [<ffffffffaf5e444a>] __do_fault.isra.59+0x8a/0x100
[4238991.895931] [<ffffffffaf50c80c>] ? __unqueue_futex+0x2c/0x60
[4238991.895934] [<ffffffffaf5e49fc>] do_read_fault.isra.61+0x4c/0x1b0
[4238991.895936] [<ffffffffaf5e93a4>] handle_pte_fault+0x2f4/0xd10
[4238991.895939] [<ffffffffaf50cf14>] ? futex_wait_queue_me+0xa4/0x130
[4238991.895941] [<ffffffffaf5ebedd>] handle_mm_fault+0x39d/0x9b0
[4238991.895945] [<ffffffffafb6f5e3>] __do_page_fault+0x203/0x500
[4238991.895947] [<ffffffffafb6f915>] do_page_fault+0x35/0x90
[4238991.895950] [<ffffffffafb6b758>] page_fault+0x28/0x30
[4238992.127706] CPU: 7 PID: 47464 Comm: docker-containe Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238992.135327] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238992.142819] Call Trace:
[4238992.145728] <IRQ> [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238992.151057] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238992.156225] [<ffffffffaf4e00b6>] ? update_curr+0xe6/0x1e0
[4238992.160988] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238992.166423] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238992.172061] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238992.176908] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238992.181321] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238992.186679] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238992.192315] [<ffffffffaf4c48e1>] ? run_posix_cpu_timers+0x51/0x830
[4238992.203493] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238992.209454] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238992.214187] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238992.218862] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238992.223895] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238992.228517] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238992.232763] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238992.237413] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238992.242400] <EOI> [<ffffffffaf5cb8f5>] ? shrink_inactive_list+0x155/0x5d0
[4238992.247911] [<ffffffffaf5cb8d9>] ? shrink_inactive_list+0x139/0x5d0
[4238992.253018] [<ffffffffaf4d64f0>] ? try_to_wake_up+0x190/0x390
[4238992.257759] [<ffffffffaf5cc465>] shrink_lruvec+0x385/0x730
[4238992.262350] [<ffffffffafb6778f>] ? __schedule+0x3ff/0x890
[4238992.266959] [<ffffffffaf5cc886>] shrink_zone+0x76/0x1a0
[4238992.271328] [<ffffffffaf5ccd90>] do_try_to_free_pages+0xf0/0x4e0
[4238992.276462] [<ffffffffaf5cd27c>] try_to_free_pages+0xfc/0x180
[4238992.281559] [<ffffffffafb5d1ef>] __alloc_pages_slowpath+0x457/0x724
[4238992.286292] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238992.291544] [<ffffffffaf60df68>] alloc_pages_current+0x98/0x110
[4238992.296475] [<ffffffffaf5b6347>] __page_cache_alloc+0x97/0xb0
[4238992.305947] [<ffffffffaf5b8fa8>] filemap_fault+0x298/0x490
[4238992.310959] [<ffffffffc01ccd0e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
[4238992.316199] [<ffffffffc01ccf0c>] xfs_filemap_fault+0x2c/0x30 [xfs]
[4238992.321173] [<ffffffffaf5e444a>] __do_fault.isra.59+0x8a/0x100
[4238992.325674] [<ffffffffaf5e49fc>] do_read_fault.isra.61+0x4c/0x1b0
[4238992.330717] [<ffffffffaf5e93a4>] handle_pte_fault+0x2f4/0xd10
[4238992.335827] [<ffffffffaf50cf14>] ? futex_wait_queue_me+0xa4/0x130
[4238992.340935] [<ffffffffaf5ebedd>] handle_mm_fault+0x39d/0x9b0
[4238992.344947] [<ffffffffafb6f5e3>] __do_page_fault+0x203/0x500
[4238992.349817] [<ffffffffafb6f915>] do_page_fault+0x35/0x90
[4238992.353774] [<ffffffffafb6b758>] page_fault+0x28/0x30
[4238992.436217] swapper/6: page allocation failure: order:0, mode:0x20
[4238992.437209] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238992.444673] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238992.444673] Call Trace:
[4238992.444673] <IRQ> [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238992.444673] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238992.444673] [<ffffffffaf4e0056>] ? update_curr+0x86/0x1e0
[4238992.444673] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238992.444673] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238992.444673] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238992.453531] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238992.496937] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238992.497906] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238992.497906] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238992.497906] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238992.497906] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238992.497906] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238992.530782] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238992.530782] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238992.530782] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238992.530782] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238992.530782] <EOI> [<ffffffffafb69aa0>] ? __cpuidle_text_start+0x8/0x8
[4238992.530782] [<ffffffffafb69ca6>] ? native_safe_halt+0x6/0x10
[4238992.530782] [<ffffffffafb69abe>] default_idle+0x1e/0xc0
[4238992.530782] [<ffffffffaf4366f0>] arch_cpu_idle+0x20/0xc0
[4238992.530782] [<ffffffffaf4fc49a>] cpu_startup_entry+0x14a/0x1e0
[4238992.530782] [<ffffffffaf457db7>] start_secondary+0x1f7/0x270
[4238992.530782] [<ffffffffaf4000d5>] start_cpu+0x5/0x14
[4238992.964111] cifsd: page allocation failure: order:0, mode:0x200020
[4238992.965047] CPU: 2 PID: 43296 Comm: cifsd Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238992.965047] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238992.965047] Call Trace:
[4238992.965047] [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238992.965047] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238992.996322] [<ffffffffaf881eea>] ? __mix_pool_bytes+0x3a/0x80
[4238992.996322] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238992.996322] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238992.996322] [<ffffffffaf60df68>] alloc_pages_current+0x98/0x110
[4238992.996322] [<ffffffffaf619033>] new_slab+0x383/0x390
[4238992.996322] [<ffffffffaf61a9ec>] ___slab_alloc+0x3ac/0x4f0
[4238992.996322] [<ffffffffafa8d031>] ? inet_bind_bucket_create+0x21/0x70
[4238992.996322] [<ffffffffafae44d3>] ? __xfrm_policy_lookup+0x53/0x60
[4238992.996322] [<ffffffffafa8d031>] ? inet_bind_bucket_create+0x21/0x70
[4238992.996322] [<ffffffffafb5e60c>] __slab_alloc+0x40/0x5c
[4238992.996322] [<ffffffffaf61c08b>] kmem_cache_alloc+0x19b/0x1f0
[4238992.996322] [<ffffffffafa57611>] ? flow_cache_lookup+0x251/0x430
[4238992.996322] [<ffffffffafa8d031>] inet_bind_bucket_create+0x21/0x70
[4238992.996322] [<ffffffffafa8d51d>] __inet_hash_connect+0x25d/0x3f0
[4238992.996322] [<ffffffffafa8ca60>] ? __inet_lookup_listener+0x310/0x310
[4238992.996322] [<ffffffffafa8cd30>] ? inet_hash+0x160/0x160
[4238993.100019] [<ffffffffafa8d6eb>] inet_hash_connect+0x3b/0x40
[4238993.103078] [<ffffffffafaa7b2e>] tcp_v4_connect+0x29e/0x4a0
[4238993.103078] [<ffffffffafabece3>] __inet_stream_connect+0xc3/0x350
[4238993.103078] [<ffffffffaf705348>] ? selinux_socket_post_create+0xc8/0x230
[4238993.103078] [<ffffffffafabefa8>] inet_stream_connect+0x38/0x50
[4238993.103078] [<ffffffffc0652bcc>] generic_ip_connect+0xdc/0x4d0 [cifs]
[4238993.103078] [<ffffffffc0654af3>] cifs_reconnect+0x2c3/0x610 [cifs]
[4238993.103078] [<ffffffffc06550ee>] cifs_readv_from_socket+0x2ae/0x310 [cifs]
[4238993.103078] [<ffffffffc065555c>] cifs_demultiplex_thread+0x11c/0xaa0 [cifs]
[4238993.103078] [<ffffffffc0655440>] ? cifs_handle_standard+0x1b0/0x1b0 [cifs]
[4238993.103078] [<ffffffffaf4c1c31>] kthread+0xd1/0xe0
[4238993.103078] [<ffffffffaf4c1b60>] ? insert_kthread_work+0x40/0x40
[4238993.103078] [<ffffffffafb74c37>] ret_from_fork_nospec_begin+0x21/0x21
[4238993.103078] [<ffffffffaf4c1b60>] ? insert_kthread_work+0x40/0x40
[4238993.103078] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[4238993.103078] cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[4238993.103078] node 0: slabs: 5070, objs: 324480, free: 0
[4238993.894036] swapper/8: page allocation failure: order:0, mode:0x20
[4238993.896797] swapper/11: page allocation failure: order:0, mode:0x20
[4238993.896801] CPU: 11 PID: 0 Comm: swapper/11 Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238993.896801] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238993.896824] Call Trace:
[4238993.896833] <IRQ> [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238993.896837] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238993.896841] [<ffffffffaf4e0056>] ? update_curr+0x86/0x1e0
[4238993.896844] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238993.896846] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238993.896848] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238993.896851] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238993.896880] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238993.896885] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238993.896900] [<ffffffffaf4cf7ca>] ? update_rq_clock.part.78+0x1a/0x150
[4238993.896905] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238993.896908] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238993.896911] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238993.896912] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238993.896915] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238993.896917] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238993.896920] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238993.896921] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238993.896925] <EOI> [<ffffffffafb69aa0>] ? __cpuidle_text_start+0x8/0x8
[4238993.896927] [<ffffffffafb69ca6>] ? native_safe_halt+0x6/0x10
[4238993.896928] [<ffffffffafb69abe>] default_idle+0x1e/0xc0
[4238993.896930] [<ffffffffaf4366f0>] arch_cpu_idle+0x20/0xc0
[4238993.896933] [<ffffffffaf4fc49a>] cpu_startup_entry+0x14a/0x1e0
[4238993.896935] [<ffffffffaf457db7>] start_secondary+0x1f7/0x270
[4238993.896937] [<ffffffffaf4000d5>] start_cpu+0x5/0x14
[4238993.963108] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
[4238993.963108] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[4238993.963108] Call Trace:
[4238993.963108] <IRQ> [<ffffffffafb61e41>] dump_stack+0x19/0x1b
[4238993.963108] [<ffffffffaf5bcaa0>] warn_alloc_failed+0x110/0x180
[4238993.963108] [<ffffffffafb5d44e>] __alloc_pages_slowpath+0x6b6/0x724
[4238993.963108] [<ffffffffaf4b7224>] ? wake_up_worker+0x24/0x30
[4238993.963108] [<ffffffffaf5c1105>] __alloc_pages_nodemask+0x405/0x420
[4238993.963108] [<ffffffffaf5c1478>] page_frag_alloc+0x158/0x170
[4238994.113107] [<ffffffffafa27e1d>] __napi_alloc_skb+0x8d/0xe0
[4238994.113107] [<ffffffffc00fe040>] netvsc_recv_callback+0x80/0x190 [hv_netvsc]
[4238994.113107] [<ffffffffc01012f3>] rndis_filter_receive+0x273/0x570 [hv_netvsc]
[4238994.113107] [<ffffffffaf8a2e37>] ? put_device+0x17/0x20
[4238994.113107] [<ffffffffaf8da36c>] ? scsi_next_command+0x2c/0x40
[4238994.113107] [<ffffffffc00fe6e9>] netvsc_poll+0x1a9/0x720 [hv_netvsc]
[4238994.113107] [<ffffffffafa39f1f>] net_rx_action+0x26f/0x390
[4238994.113107] [<ffffffffaf4a0f05>] __do_softirq+0xf5/0x280
[4238994.113107] [<ffffffffafb7832c>] call_softirq+0x1c/0x30
[4238994.113107] [<ffffffffaf42e675>] do_softirq+0x65/0xa0
[4238994.113107] [<ffffffffaf4a1285>] irq_exit+0x105/0x110
[4238994.113107] [<ffffffffaf455244>] hyperv_vector_handler+0x64/0x70
[4238994.113107] [<ffffffffafb78722>] hyperv_callback_vector+0x162/0x170
[4238994.113107] <EOI> [<ffffffffafb69aa0>] ? __cpuidle_text_start+0x8/0x8
[4238994.113107] [<ffffffffafb69ca6>] ? native_safe_halt+0x6/0x10
[4238994.113107] [<ffffffffafb69abe>] default_idle+0x1e/0xc0
[4238994.113107] [<ffffffffaf4366f0>] arch_cpu_idle+0x20/0xc0
[4238994.113107] [<ffffffffaf4fc49a>] cpu_startup_entry+0x14a/0x1e0
[4238994.113107] [<ffffffffaf457db7>] start_secondary+0x1f7/0x270
[4238994.113107] [<ffffffffaf4000d5>] start_cpu+0x5/0x14
I noticed some of my pods were crashing last night. Checking kube-controller logs I found the following:
I0126 01:03:24.264593 1 nodecontroller.go:608] NodeController detected that zone brazilsouth::2 is now in state FullDisruption.
After that one node and the master went NotReady:
What does the FullDisruption mean? Is it a failure on Azure itself or on the k8s side? I couldn't find any failure report by Azure in the given timestamp. I've also checked my vms hardware monitoring and there is nothing that raises an eye.
Thanks