Closed rbhamre closed 4 years ago
Thanks @rbhamre again for reporting this!
Since this only happened during networkpolicy churn test, I suspect it's due to deadlock or goroutine hanging in networkpolicy goroutines. antrea-ovs container state, it was terminated at 15:44:30:
State: Running
Started: Thu, 27 Feb 2020 15:44:31 +0000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 26 Feb 2020 04:09:47 +0000
Finished: Thu, 27 Feb 2020 15:44:30 +0000
antrea-agent logs:
I0227 15:44:29.589766 1 reconciler.go:420] Forgetting rule 91479e4303af1dce
I0227 15:44:29.589772 1 reconciler.go:420] Forgetting rule d13d47cf305af106
time="2020-02-27T15:44:30Z" level=warning msg="InboundError EOF"
time="2020-02-27T15:44:30Z" level=info msg="Closing OpenFlow message stream."
time="2020-02-27T15:44:30Z" level=warning msg="Received ERROR message from switch 00:00:f2:5b:7b:0c:97:41. Err: EOF"
I0227 15:44:30.694862 1 ofctrl_bridge.go:198] OFSwitch is disconnected: 00:00:f2:5b:7b:0c:97:41
time="2020-02-27T15:44:30Z" level=info msg="Initialize connection or re-connect to /var/run/openvswitch/br-int.mgmt."
time="2020-02-27T15:44:30Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
I0227 15:44:30.747947 1 reconciler.go:174] Reconciling rule b39edc3b89589e0b
I0227 15:44:30.748012 1 reconciler.go:174] Reconciling rule ace1cc1d7971081e
I0227 15:44:31.031841 1 reconciler.go:174] Reconciling rule aae580584853353f
I0227 15:44:31.031889 1 reconciler.go:174] Reconciling rule 80adac5840bae983
time="2020-02-27T15:44:31Z" level=error msg="Failed to connect to /var/run/openvswitch/br-int.mgmt: dial unix /var/run/openvswitch/br-int.mgmt: connect: connection refused, retry after 1 second."
I0227 15:44:31.807598 1 reconciler.go:420] Forgetting rule ace1cc1d7971081e
I0227 15:44:32.095370 1 reconciler.go:420] Forgetting rule 80adac5840bae983
time="2020-02-27T15:44:32Z" level=info msg="Connected to socket /var/run/openvswitch/br-int.mgmt"
time="2020-02-27T15:44:32Z" level=info msg="New connection.."
time="2020-02-27T15:44:32Z" level=info msg="Send hello with OF version: 4"
time="2020-02-27T15:44:32Z" level=info msg="Received Openflow 1.3 Hello message"
time="2020-02-27T15:44:32Z" level=info msg="Received ofp1.3 Switch feature response: {Header:{Version:4 Type:6 Length:32 Xid:834655} DPID:00:00:f2:5b:7b:0c:97:41 Buffers:0 NumTables:254 AuxilaryId:0 pad:[0 0] Capabilities:79 Actions:0 Ports:[]}"
time="2020-02-27T15:44:32Z" level=info msg="Openflow Connection for new switch: 00:00:f2:5b:7b:0c:97:41"
I0227 15:44:32.695797 1 ofctrl_bridge.go:178] OFSwitch is connected: 00:00:f2:5b:7b:0c:97:41
I0227 15:44:32.695824 1 agent.go:317] Replaying OF flows to OVS bridge
E0227 15:44:33.675645 1 ovs_client.go:665] Transaction failed: no connection
E0227 15:44:33.675670 1 querier.go:89] Failed to get OVS client version: no connection
I0227 15:44:33.722436 1 reconciler.go:174] Reconciling rule 6e87fff421a5ddda
I0227 15:44:33.722478 1 reconciler.go:174] Reconciling rule bf1005ec301a0f9c
I0227 15:44:34.090648 1 reconciler.go:174] Reconciling rule 1e858e7ee89a6bdb
I0227 15:55:18.861281 1 networkpolicy_controller.go:208] Stop watching AppliedToGroups, total 4901 items received
I0227 15:55:18.861312 1 networkpolicy_controller.go:197] Start watching AppliedToGroups
At 15:44:32.695824, it tried to replay OF flows after reconnecting to OVS but never finished (otherwise there should be a log Flow replay completed
)
ReplayFlows
needs replayMutex
lock, which might have been acquired by NetworkPolicy routines via InstallPolicyRuleFlows
.
Discussed with @wenyingd, there could be a logic in openflow client that could be affected OVS restarting: if a bundle has been sent but never gets reply, it might wait for it forever so the function is not going to return and release the lock.
@wenyingd is trying to reproduce it by building a similar scenario, she will confirm whether this is the root cause, and provide a fix if yes.
I added a test and find that the functions to send OpenFlow messages might not return if OVS is restarted at the same time. There are two possible reasons blocking the function: 1) OFSwitch is writing message to an outbound channel, but no consumer at the peer because connection to OVS is blocked. 2) Bundle implementation is waiting for reply of the BundleControl message, but after OVS is restarted, no reply is sent out from OVS.
I will make a fix in ofnet to add timeout in both OFswitch.Send
and Bundle implementation where to wait for reply. And add an integration test in Antrea for such case.
Describe the bug ##################################
Below are few antrea-agent pods on which flows are missing
##################################
To Reproduce
##################################
##################################
Expected ##################################
##################################
Actual behavior ##################################
After certain iterations, all open flows are missing from few antrea-agents which causes 100% traffic drop from the pods which are hosting on these worker-nodes
################################## Versions: Please provide the following information:
antctl version
{ "agentVersion": "v0.4.0", "antctlVersion": "v0.4.0" } #######################
####################### kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:36:53Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
#######################
####################### modinfo openvswitch
filename: /lib/modules/4.4.0-131-generic/kernel/net/openvswitch/openvswitch.ko license: GPL description: Open vSwitch switching datapath srcversion: B5716AFB39C1B8C5FBC323B depends: nf_conntrack,libcrc32c,nf_defrag_ipv6 retpoline: Y intree: Y vermagic: 4.4.0-131-generic SMP mod_unload modversions retpoline
#######################
Additional context Add any other context about the problem here, such as Antrea logs, kubelet logs, etc.
(Please consider pasting long output into a GitHub gist or any other pastebin.)