rbhamre commented 4 years ago

Describe the bug On scale setup having 100k pods, 50k services and 50k network-policies, accessing SVC-IPs from pods is failing with timeout error.

NOTE: Due to network-policy issue https://github.com/vmware-tanzu/antrea/issues/442 used latest antrea build in which this issue was observed. Below is the latest antrea build details used in the setup:

antctl version

{ "agentVersion": "v0.5.0-dev-bf8ed18", "antctlVersion": "v0.5.0-dev-bf8ed18" }

To Reproduce

Deployed single large k8s-cluster having 3 masters and 1k worker nodes.
Deployed 100k pods, 50k services and 50k network-policies
Now, tried to access svc-ips from pods however its failing with timeout error.

Expected

Accessing svc-ips should not fail

Actual behavior

We can see that accessing svc-ips was failed:

root@oc-runner:~/nsx-qe/spark57745# kubectl exec -it -n scale-ns-1 coffee-rc-1-vqncb -- curl --connect-timeout 10 10.105.202.37 curl: (28) Connection timed out after 10001 milliseconds command terminated with exit code 28

root@oc-runner:~/nsx-qe/spark57746# kubectl exec -it -n scale-ns-1 coffee-rc-1-vqncb -- curl --connect-timeout 10 10.105.202.37 curl: (28) Connection timed out after 10000 milliseconds command terminated with exit code 28

Versions: Please provide the following information:

Antrea version (Docker image tag). #######################

antctl version

{ "agentVersion": "v0.5.0-dev-bf8ed18", "antctlVersion": "v0.5.0-dev-bf8ed18" } #######################

Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them.

####################### kubectl version

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

#######################

Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using?

Linux kernel version on the Kubernetes Nodes (uname -r). ####################### ~# uname -r 4.4.0-131-generic #######################

If you chose to compile the Open vSwitch kernel module manually instead of using the kernel module built into the Linux kernel, which version of the OVS kernel module are you using? Include the output of modinfo openvswitch for the Kubernetes Nodes.

####################### modinfo openvswitch

filename: /lib/modules/4.4.0-131-generic/kernel/net/openvswitch/openvswitch.ko license: GPL description: Open vSwitch switching datapath srcversion: B5716AFB39C1B8C5FBC323B depends: nf_conntrack,libcrc32c,nf_defrag_ipv6 retpoline: Y intree: Y vermagic: 4.4.0-131-generic SMP mod_unload modversions retpoline

#######################

Additional context Add any other context about the problem here, such as Antrea logs, kubelet logs, etc. (Please consider pasting long output into a GitHub gist or any other pastebin.)

tnqn commented 4 years ago

When the access failed, the packets were NATed to specific Pod. Packets captured on network device of target Pod:

06:05:23.537595 92:04:ab:f7:f1:eb > aa:bb:cc:dd:ee:ff, ethertype IPv4 (0x0800), length 74: 192.2.139.2.48996 > 192.2.81.2.80: Flags [S], seq 3954618874, win 28200, options [mss 1410,sackOK,TS val 126458400 ecr 0,nop,wscale 7], length 0

The packets has the virtual MAC as dst MAC, so it's because L3 forwarding flows not working. Dumped the flows:

# ovs-ofctl dump-flows br-int
 cookie=0x0, duration=20432.065s, table=0, n_packets=68, n_bytes=5056, priority=0 actions=NORMAL

It should be caused by flows not being replayed after restarting ovs, but not blocked by the same reason as #442 .

I0306 00:38:53.025782       1 reconciler.go:172] Reconciling rule ca02c536ca18567d
time="2020-03-06T00:40:14Z" level=warning msg="InboundError EOF"
time="2020-03-06T00:40:14Z" level=info msg="Closing OpenFlow message stream."
time="2020-03-06T00:40:14Z" level=warning msg="Received ERROR message from switch 00:00:92:1a:1e:06:7a:4f. Err: EOF"
I0306 00:40:14.545995       1 ofctrl_bridge.go:198] OFSwitch is disconnected: 00:00:92:1a:1e:06:7a:4f
time="2020-03-06T00:40:14Z" level=info msg="Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt."
time="2020-03-06T00:40:14Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, re
try after 1 second."
time="2020-03-06T00:40:15Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, re
try after 1 second."
time="2020-03-06T00:40:16Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, re
try after 1 second."
time="2020-03-06T00:40:17Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, re
try after 1 second."
time="2020-03-06T00:40:18Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, re
try after 1 second."
I0306 00:54:57.404221       1 networkpolicy_controller.go:318] Stop watching NetworkPolicies, total 210 items received
I0306 00:54:57.404257       1 networkpolicy_controller.go:307] Start watching NetworkPolicies
I0306 00:57:23.205109       1 networkpolicy_controller.go:263] Stop watching AddressGroups, total 105 items received
I0306 00:57:23.205146       1 networkpolicy_controller.go:252] Start watching AddressGroups
I0306 00:57:23.309681       1 reconciler.go:172] Reconciling rule d928ef264294e102
I0306 00:57:23.309889       1 reconciler.go:172] Reconciling rule 337f102f3943493f
I0306 00:57:23.309934       1 reconciler.go:172] Reconciling rule 3b37f4307838643f
I0306 00:57:23.310005       1 reconciler.go:172] Reconciling rule e6db7bbf6015d09b
I0306 00:57:23.310029       1 reconciler.go:172] Reconciling rule 29efbff44cfb5086
I0306 00:57:23.310056       1 reconciler.go:172] Reconciling rule d2190cc96dbeed91
I0306 00:57:23.310107       1 reconciler.go:172] Reconciling rule 59e4201ca118f508

This time it didn't even start replaying flows, seems agent didn't think ovs was back.

tnqn commented 4 years ago

The issue is not specific to master branch, I can reproduce it with v0.4.0 too: Killed the antrea-ovs container, if it took more than 5 seconds to restart ovs, the agent will loss connection to it and cannot recover. Successful reconnecting case:

time="2020-03-06T06:44:02Z" level=warning msg="InboundError EOF"
time="2020-03-06T06:44:02Z" level=info msg="Closing OpenFlow message stream."
time="2020-03-06T06:44:02Z" level=warning msg="Received ERROR message from switch 00:00:82:e1:72:f3:f4:49. Err: EOF"
I0306 06:44:02.552976       1 ofctrl_bridge.go:198] OFSwitch is disconnected: 00:00:82:e1:72:f3:f4:49
time="2020-03-06T06:44:02Z" level=info msg="Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt."
time="2020-03-06T06:44:02Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
time="2020-03-06T06:44:03Z" level=info msg="Connected to socket /var/run/openvswitch/br-int.mgmt"
time="2020-03-06T06:44:03Z" level=info msg="New connection.."
time="2020-03-06T06:44:03Z" level=info msg="Send hello with OF version: 4"
time="2020-03-06T06:44:03Z" level=info msg="Received Openflow 1.3 Hello message"
time="2020-03-06T06:44:03Z" level=info msg="Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:182} DPID:00:00:82:e1:72:f3:f4:49 Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}"
time="2020-03-06T06:44:03Z" level=info msg="Openflow Connection for new switch: 00:00:82:e1:72:f3:f4:49"
I0306 06:44:03.554063       1 ofctrl_bridge.go:178] OFSwitch is connected: 00:00:82:e1:72:f3:f4:49
I0306 06:44:03.554093       1 agent.go:317] Replaying OF flows to OVS bridge
I0306 06:44:03.569873       1 agent.go:319] Flow replay completed

Failed reconnecting case:

time="2020-03-06T06:44:39Z" level=warning msg="InboundError EOF"
time="2020-03-06T06:44:39Z" level=info msg="Closing OpenFlow message stream."
time="2020-03-06T06:44:39Z" level=warning msg="Received ERROR message from switch 00:00:82:e1:72:f3:f4:49. Err: EOF"
I0306 06:44:39.580995       1 ofctrl_bridge.go:198] OFSwitch is disconnected: 00:00:82:e1:72:f3:f4:49
time="2020-03-06T06:44:39Z" level=info msg="Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt."
time="2020-03-06T06:44:39Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
time="2020-03-06T06:44:40Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
time="2020-03-06T06:44:41Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
time="2020-03-06T06:44:42Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
time="2020-03-06T06:44:43Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
E0306 06:45:02.562728       1 ovs_client.go:665] Transaction failed: no connection
E0306 06:45:02.562751       1 querier.go:89] Failed to get OVS client version: no connection

Looks like bug of reconnecting mechanism of ovslib, but I didn't see openflow client recovered too, are they related? @wenyingd @jianjuns

tnqn commented 4 years ago

To clarify, the ovsdb connection is recovered, it's openflow client not recovered, @wenyingd has found where the problem is, she will update and provide a fix.

antrea-io / antrea

On scale setup having 100k pods, 50k services and 50k network-policies, accessing SVC-IPs from pods is failing with timeout error. #475

antctl version