antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.65k stars 362 forks source link

Traceflow fails with "bundle reply is timeout" #937

Closed abhiraut closed 4 years ago

abhiraut commented 4 years ago

Describe the bug Start a trace between two pods using traceflow CRD. Status fails with the following error

status:
  dataplaneTag: 1
  phase: Failed
  reason: 'Node: tf1, error: bundle reply is timeout'

To Reproduce Following yaml used for Traceflow

apiVersion: ops.antrea.tanzu.vmware.com/v1alpha1
kind: Traceflow
metadata:
    name: tf1
spec:
        source:
                pod: appdns
                namespace: default
        destination:
                pod: appserver
                namespace: default
        packet:
                transportHeader:
                        tcp:
                                dstPort: 80
                ipHeader:
                        protocol: 6

Expected Expected trace to succeed

Actual behavior Trace failed with "bundle reply is timeout"

Versions: Please provide the following information:

modinfo openvswitch
filename:       /lib/modules/4.15.0-88-generic/kernel/net/openvswitch/openvswitch.ko
alias:          net-pf-16-proto-16-family-ovs_meter
alias:          net-pf-16-proto-16-family-ovs_packet
alias:          net-pf-16-proto-16-family-ovs_flow
alias:          net-pf-16-proto-16-family-ovs_vport
alias:          net-pf-16-proto-16-family-ovs_datapath
license:        GPL
description:    Open vSwitch switching datapath
srcversion:     304614E578D29023BC545F1
depends:        nf_conntrack,nf_nat,libcrc32c,nf_nat_ipv6,nf_nat_ipv4,nf_defrag_ipv6,nsh
retpoline:      Y
intree:         Y
name:           openvswitch
vermagic:       4.15.0-88-generic SMP mod_unload 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4
abhiraut commented 4 years ago

/cc @tnqn

gran-vmv commented 4 years ago

@abhiraut Could you share environment info with @wenyingd and me? We have hit this issue before, but cannot analyze the root cause without access to OVS.

gran-vmv commented 4 years ago

@abhiraut Wenying and I found the root cause. I'll enhance the reliability on below code snippet in pkg/agent/openflow/packetin.go:

    wait.PollUntil(time.Second, func() (done bool, err error) {
        pktIn := <-ch
        for name, handler := range c.packetInHandlers {
            err = handler.HandlePacketIn(pktIn)
            if err != nil {
                klog.Errorf("PacketIn handler %s failed to process packet: %+v", name, err)
            }
        }
        return false, err
    }, stopCh)

You won't get this error if you have fixed the comment https://github.com/vmware-tanzu/antrea/pull/918#pullrequestreview-446165465 But if you get this error again, please workaround this by changing from return false, err to return false, nil

wenyingd commented 4 years ago

The root cause is, some error happend in PacketInHandler, which causes the thread jump out of the for-loop. There is a channel between the PacketInHandler and the ofnet, ofnet is blocking at sending new "PacketIn" message into the channel (no consumer is at the other side of the channel at that time). Hence, ofnet could not handle the next "inbound" message. But ofnet's "outbound" channel is working well, so we could continue to sending Bundle control message out to OVS. But ofnet can't receive the reply for Bundle control message, hence Antrea got the timeout error.