antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.66k stars 368 forks source link

After performing network-policy churn on scale setup, all open flows are missing from few antrea-agent pods which causes 100% traffic drop. #442

Closed rbhamre closed 4 years ago

rbhamre commented 4 years ago

Describe the bug ##################################

Below are few antrea-agent pods on which flows are missing

kubectl exec -it -n kube-system antrea-agent-fgmfv -- bash

root@prom-05056996f45:/# ovs-ofctl dump-flows br-int cookie=0x0, duration=55801.825s, table=0, n_packets=24821, n_bytes=2197706, priority=0 actions=NORMAL root@prom-05056996f45:/#

kubectl exec -it -n kube-system antrea-agent-j7wkj -- bash

root@prom-0505699ee18:/# ovs-ofctl dump-flows br-int cookie=0x0, duration=46382.273s, table=0, n_packets=30903, n_bytes=2788534, priority=0 actions=NORMAL root@prom-0505699ee18:/#

##################################

To Reproduce

##################################

##################################

Expected ##################################

##################################

Actual behavior ##################################

After certain iterations, all open flows are missing from few antrea-agents which causes 100% traffic drop from the pods which are hosting on these worker-nodes

################################## Versions: Please provide the following information:

Antrea version (Docker image tag).
#######################

antctl version

{ "agentVersion": "v0.4.0", "antctlVersion": "v0.4.0" } #######################

Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them.

####################### kubectl version

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

#######################

Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using?

Linux kernel version on the Kubernetes Nodes (uname -r).
#######################
~# uname -r
4.4.0-131-generic
#######################

If you chose to compile the Open vSwitch kernel module manually instead of using the kernel module built into the Linux kernel, which version of the OVS kernel module are you using? Include the output of modinfo openvswitch for the Kubernetes Nodes.

####################### modinfo openvswitch

filename: /lib/modules/4.4.0-131-generic/kernel/net/openvswitch/openvswitch.ko license: GPL description: Open vSwitch switching datapath srcversion: B5716AFB39C1B8C5FBC323B depends: nf_conntrack,libcrc32c,nf_defrag_ipv6 retpoline: Y intree: Y vermagic: 4.4.0-131-generic SMP mod_unload modversions retpoline

#######################

Additional context Add any other context about the problem here, such as Antrea logs, kubelet logs, etc.

(Please consider pasting long output into a GitHub gist or any other pastebin.)

rbhamre commented 4 years ago

antrea-agent-1.tar.gz antrea-agent-2.tar.gz

tnqn commented 4 years ago

Thanks @rbhamre again for reporting this!

Since this only happened during networkpolicy churn test, I suspect it's due to deadlock or goroutine hanging in networkpolicy goroutines. antrea-ovs container state, it was terminated at 15:44:30:

    State:          Running
      Started:      Thu, 27 Feb 2020 15:44:31 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 26 Feb 2020 04:09:47 +0000
      Finished:     Thu, 27 Feb 2020 15:44:30 +0000

antrea-agent logs:

I0227 15:44:29.589766       1 reconciler.go:420] Forgetting rule 91479e4303af1dce
I0227 15:44:29.589772       1 reconciler.go:420] Forgetting rule d13d47cf305af106
time="2020-02-27T15:44:30Z" level=warning msg="InboundError EOF"
time="2020-02-27T15:44:30Z" level=info msg="Closing OpenFlow message stream."
time="2020-02-27T15:44:30Z" level=warning msg="Received ERROR message from switch 00:00:f2:5b:7b:0c:97:41. Err: EOF"
I0227 15:44:30.694862       1 ofctrl_bridge.go:198] OFSwitch is disconnected: 00:00:f2:5b:7b:0c:97:41
time="2020-02-27T15:44:30Z" level=info msg="Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt."
time="2020-02-27T15:44:30Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
I0227 15:44:30.747947       1 reconciler.go:174] Reconciling rule b39edc3b89589e0b
I0227 15:44:30.748012       1 reconciler.go:174] Reconciling rule ace1cc1d7971081e
I0227 15:44:31.031841       1 reconciler.go:174] Reconciling rule aae580584853353f
I0227 15:44:31.031889       1 reconciler.go:174] Reconciling rule 80adac5840bae983
time="2020-02-27T15:44:31Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
I0227 15:44:31.807598       1 reconciler.go:420] Forgetting rule ace1cc1d7971081e
I0227 15:44:32.095370       1 reconciler.go:420] Forgetting rule 80adac5840bae983
time="2020-02-27T15:44:32Z" level=info msg="Connected to socket /var/run/openvswitch/br-int.mgmt"
time="2020-02-27T15:44:32Z" level=info msg="New connection.."
time="2020-02-27T15:44:32Z" level=info msg="Send hello with OF version: 4"
time="2020-02-27T15:44:32Z" level=info msg="Received Openflow 1.3 Hello message"
time="2020-02-27T15:44:32Z" level=info msg="Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:834655} DPID:00:00:f2:5b:7b:0c:97:41 Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}"
time="2020-02-27T15:44:32Z" level=info msg="Openflow Connection for new switch: 00:00:f2:5b:7b:0c:97:41"
I0227 15:44:32.695797       1 ofctrl_bridge.go:178] OFSwitch is connected: 00:00:f2:5b:7b:0c:97:41
I0227 15:44:32.695824       1 agent.go:317] Replaying OF flows to OVS bridge
E0227 15:44:33.675645       1 ovs_client.go:665] Transaction failed: no connection
E0227 15:44:33.675670       1 querier.go:89] Failed to get OVS client version: no connection
I0227 15:44:33.722436       1 reconciler.go:174] Reconciling rule 6e87fff421a5ddda
I0227 15:44:33.722478       1 reconciler.go:174] Reconciling rule bf1005ec301a0f9c
I0227 15:44:34.090648       1 reconciler.go:174] Reconciling rule 1e858e7ee89a6bdb
I0227 15:55:18.861281       1 networkpolicy_controller.go:208] Stop watching AppliedToGroups, total 4901 items received
I0227 15:55:18.861312       1 networkpolicy_controller.go:197] Start watching AppliedToGroups

At 15:44:32.695824, it tried to replay OF flows after reconnecting to OVS but never finished (otherwise there should be a log Flow replay completed) ReplayFlows needs replayMutex lock, which might have been acquired by NetworkPolicy routines via InstallPolicyRuleFlows. Discussed with @wenyingd, there could be a logic in openflow client that could be affected OVS restarting: if a bundle has been sent but never gets reply, it might wait for it forever so the function is not going to return and release the lock. @wenyingd is trying to reproduce it by building a similar scenario, she will confirm whether this is the root cause, and provide a fix if yes.

wenyingd commented 4 years ago

I added a test and find that the functions to send OpenFlow messages might not return if OVS is restarted at the same time. There are two possible reasons blocking the function: 1) OFSwitch is writing message to an outbound channel, but no consumer at the peer because connection to OVS is blocked. 2) Bundle implementation is waiting for reply of the BundleControl message, but after OVS is restarted, no reply is sent out from OVS. I will make a fix in ofnet to add timeout in both OFswitch.Send and Bundle implementation where to wait for reply. And add an integration test in Antrea for such case.