Traceflow failed when using tcp protocol

gujun4990 commented 1 year ago

Describe the bug I have two pod: antrea-octant and nginx. I would like to traceflow packet from antrea-octant to nginx using tcp protocol.

root@slave1:antrea# kubectl get nodes
NAME     STATUS   ROLES           AGE    VERSION
master   Ready    control-plane   4d1h   v1.27.1
slave1   Ready    <none>          4d1h   v1.27.1

root@slave1:antrea# kubectl get po -A -owide|grep -Ei "nginx|octant"
kube-system   antrea-octant-d446dfb7f-sxs69      1/1     Running   0               58m    10.244.0.9       master   <none>           <none>
ns-test       nginx-deployment-f6dc544c7-szxv7   1/1     Running   0               29m    10.244.1.30      slave1   <none>           <none>

To Reproduce

create traceflow crd yaml

apiVersion: crd.antrea.io/v1alpha1
kind: Traceflow
metadata:
name: tcp-tf1
spec:
#liveTraffic: true      
destination:
namespace: ns-test
pod: nginx-deployment-f6dc544c7-szxv7
source:
namespace: kube-system
pod: antrea-octant-d446dfb7f-sxs69
# destination can also be an IP address ('ip' field) or a Service name ('service' field); the 3 choices are mutually exclusive.
packet:
ipHeader: # If ipHeader/ipv6Header is not set, the default value is IPv4+ICMP.
  protocol: 6 # Protocol here can be 6 (TCP), 17 (UDP) or 1 (ICMP), default value is 1 (ICMP)
transportHeader:
  tcp:
    srcPort: 50012 # Source port needs to be set when Protocol is TCP/UDP.
    dstPort: 80 # Destination port needs to be set when Protocol is TCP/UDP.

Check the result


root@master:antrea_yaml# kubectl get traceflow  -oyaml
apiVersion: v1
items:

apiVersion: crd.antrea.io/v1alpha1 kind: Traceflow metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"crd.antrea.io/v1alpha1","kind":"Traceflow","metadata":{"annotations":{},"name":"tcp-tf1"},"spec":{"destination":{"namespace":"ns-test","pod":"nginx-deployment-f6dc544c7-szxv7"},"packet":{"ipHeader":{"protocol":6},"transportHeader":{"tcp":{"dstPort":80,"srcPort":50012}}},"source":{"namespace":"kube-system","pod":"antrea-octant-d446dfb7f-sxs69"}}} creationTimestamp: "2023-04-25T09:25:25Z" generation: 1 name: tcp-tf1 resourceVersion: "510069" uid: b2968d1a-bc83-4c65-9be3-1d8488ee5ebc spec: destination: namespace: ns-test pod: nginx-deployment-f6dc544c7-szxv7 packet: ipHeader: protocol: 6 transportHeader: tcp: dstPort: 80 srcPort: 50012 source: namespace: kube-system pod: antrea-octant-d446dfb7f-sxs69 status: phase: Failed reason: Traceflow timeout startTime: "2023-04-25T09:25:25Z" kind: List metadata: resourceVersion: ""
```
The result is failed.
I use `antctl` command is the same result.
```
root@master:/# antctl traceflow -S kube-system/antrea-octant-d446dfb7f-sxs69 -D ns-test/nginx-deployment-f6dc544c7-szxv7 -f tcp,tcp_dst=80 name: kube-system-antrea-octant-d446dfb7f-sxs69-to-ns-test-nginx-deployment-f6dc544c7-szxv7-7c6lfq22 phase: Failed reason: Traceflow timeout source: kube-system/antrea-octant-d446dfb7f-sxs69 destination: ns-test/nginx-deployment-f6dc544c7-szxv7

Expected Traceflow is succeed when using tcp protocol from a pod to anther.

Actual behavior Traceflow is failed.

Versions:

kubernetes version

root@master:openvswitch# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:21:19Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:14:42Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}

antrea version

root@master:~# kubectl exec -itn kube-system   antrea-agent-4p2pq    bash -- antctl version
agentVersion: 1.12.0-dev-1a4e331.clean
antctlVersion: v1.12.0-dev-1a4e331

openvswitch version

root@master:~# kubectl exec -itn kube-system   antrea-agent-4p2pq -c antrea-ovs    -- ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.5
DB Schema 8.3.0

openvswitch debug information:

create openflow

2023-04-25T09:25:25.174Z|745011|vconn|DBG|unix#0: received: OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x12249):
bundle_id=0xf7 type=OPEN_REQUEST flags=atomic
2023-04-25T09:25:25.174Z|745012|vconn|DBG|unix#0: sent (Success): OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x12249):
bundle_id=0xf7 type=OPEN_REPLY flags=0
2023-04-25T09:25:25.174Z|745013|poll_loop|DBG|wakeup due to [POLLIN] on fd 53 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (0% CPU usage)
2023-04-25T09:25:25.175Z|745014|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x1224c):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x1224c): ADD table:7 priority=191,ip,nw_tos=28 cookie:0x2070000000000 hard:20 actions=goto_table:8
2023-04-25T09:25:25.175Z|745015|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x1224e):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x1224e): ADD table:7 priority=192,ct_state=+rpl+trk,ip,nw_tos=28 cookie:0x2070000000000 hard:20 actions=drop
2023-04-25T09:25:25.175Z|745016|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x12250):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x12250): ADD table:28 priority=203,ip,reg0=0x100/0x100,reg1=0x1,nw_tos=28 cookie:0x2070000000000 hard:20 actions=output:NXM_NX_REG1[],meter:2,controller(max_len=128,id=8339)
2023-04-25T09:25:25.175Z|745017|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x12252):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x12252): ADD table:28 priority=202,ip,reg0=0x100/0x100,reg1=0x2,nw_tos=28 cookie:0x2070000000000 hard:20 actions=meter:2,controller(max_len=128,id=8339)
2023-04-25T09:25:25.175Z|745018|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x12254):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x12254): ADD table:28 priority=203,ip,reg0=0x100/0x100,reg1=0x2,nw_dst=10.244.0.1,nw_tos=28 cookie:0x2070000000000 hard:20 actions=meter:2,controller(max_len=128,id=8339)
2023-04-25T09:25:25.175Z|745019|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x12256):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x12256): ADD table:28 priority=202,ip,reg0=0x100/0x100,nw_tos=28 cookie:0x2070000000000 hard:20 actions=meter:2,controller(max_len=128,id=8339)
2023-04-25T09:25:25.175Z|745020|vconn|DBG|unix#0: received: OFPT_BUNDLE_ADD_MESSAGE (OF1.5) (xid=0x12258):
bundle_id=0xf7 flags=atomic
OFPT_FLOW_MOD (OF1.5) (xid=0x12258): ADD table:28 priority=212,ct_mark=0x40/0x40,ip,nw_tos=28 cookie:0x2070000000000 hard:20 actions=meter:2,controller(max_len=128,id=8339)
2023-04-25T09:25:25.175Z|745021|vconn|DBG|unix#0: received: OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x12259):
bundle_id=0xf7 type=CLOSE_REQUEST flags=atomic
2023-04-25T09:25:25.175Z|745022|vconn|DBG|unix#0: sent (Success): OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x12259):
bundle_id=0xf7 type=CLOSE_REPLY flags=0
2023-04-25T09:25:25.175Z|745023|poll_loop|DBG|wakeup due to [POLLIN] on fd 53 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (0% CPU usage)
2023-04-25T09:25:25.175Z|745024|vconn|DBG|unix#0: received: OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x1225b):
bundle_id=0xf7 type=COMMIT_REQUEST flags=atomic
2023-04-25T09:25:25.175Z|745025|vconn|DBG|unix#0: sent (Success): OFPT_BUNDLE_CONTROL (OF1.5) (xid=0x1225b):
bundle_id=0xf7 type=COMMIT_REPLY flags=0

packet_out, miss upcall

2023-04-25T09:25:27.176Z|745040|vconn|DBG|unix#0: received: OFPT_PACKET_OUT (OF1.5) (xid=0x1225d): in_port=4 actions=set_field:28/0xfc->nw_tos,TABLE data_len=54
tcp,vlan_tci=0x0000,dl_src=b2:1c:bd:da:c4:15,dl_dst=f6:29:3c:59:1d:6f,nw_src=10.244.0.9,nw_dst=10.244.1.30,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=50012,tp_dst=80,tcp_flags=0 tcp_csum:a691
2023-04-25T09:25:27.177Z|745041|dpif|DBG|Dropped 5 log messages in last 0 seconds (most recently, 0 seconds ago) due to excessive rate
2023-04-25T09:25:27.177Z|13211|poll_loop(handler2)|DBG|wakeup due to [POLLIN] on fd 28 (unknown anon_inode:[eventpoll]) at lib/dpif-netlink.c:3183
2023-04-25T09:25:27.177Z|11008|poll_loop(handler6)|DBG|wakeup due to [POLLIN] on fd 31 (unknown anon_inode:[eventpoll]) at lib/dpif-netlink.c:3183
2023-04-25T09:25:27.177Z|11567|poll_loop(handler4)|DBG|wakeup due to [POLLIN] on fd 30 (unknown anon_inode:[eventpoll]) at lib/dpif-netlink.c:3183 (0% CPU usage)
2023-04-25T09:25:27.177Z|14340|poll_loop(handler1)|DBG|wakeup due to [POLLIN] on fd 27 (unknown anon_inode:[eventpoll]) at lib/dpif-netlink.c:3183
2023-04-25T09:25:27.177Z|12390|poll_loop(handler3)|DBG|wakeup due to [POLLIN] on fd 29 (unknown anon_inode:[eventpoll]) at lib/dpif-netlink.c:3183
2023-04-25T09:25:27.177Z|745042|dpif|DBG|system@ovs-system: execute set(ipv4(tos=0x1c/0xfc)),ct(zone=65520,nat),recirc(0x8b) on packet tcp,vlan_tci=0x0000,dl_src=b2:1c:bd:da:c4:15,dl_dst=f6:29:3c:59:1d:6f,nw_src=10.244.0.9,nw_dst=10.244.1.30,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=50012,tp_dst=80,tcp_flags=0 tcp_csum:a691
with metadata skb_priority(0),skb_mark(0),in_port(4) mtu 0
2023-04-25T09:25:27.177Z|13212|dpif(handler2)|DBG|system@ovs-system: miss upcall:
recirc_id(0x8b),dp_hash(0),skb_priority(0),in_port(4),skb_mark(0),ct_state(0x30),ct_zone(0xfff0),ct_mark(0),ct_label(0),eth(src=b2:1c:bd:da:c4:15,dst=f6:29:3c:59:1d:6f),eth_type(0x0800),ipv4(src=10.244.0.9,dst=10.244.1.30,proto=6,tos=0x1c,ttl=64,frag=no),tcp(src=50012,dst=80),tcp_flags(0)
tcp,vlan_tci=0x0000,dl_src=b2:1c:bd:da:c4:15,dl_dst=f6:29:3c:59:1d:6f,nw_src=10.244.0.9,nw_dst=10.244.1.30,nw_tos=28,nw_ecn=0,nw_ttl=64,nw_frag=no,tp_src=50012,tp_dst=80,tcp_flags=0 tcp_csum:a691
2023-04-25T09:25:27.177Z|13213|ofproto_dpif_upcall(handler2)|DBG|upcall: no reference for recirc flow

I found have no packet_in information from openvswitch to antrea-agent controller.

tnqn commented 1 year ago

@gujun4990 thanks for the report. It's perhaps because the tcp flags was not set, causing the packet to be dropped by connection state check (a TCP packet with neither of SYN, ACK, ... flags is considered invalid). You may add "tcp_flags=2" to construct a SYN packet.

That being said, I'm not sure why the example in antctl traceflow doesn't set tcp_flags. Perhaps there was a default flags set on server side but removed sometime ago or the example never worked. Regardless, this is indeed an issue which should be fixed, as the other two prococols, udp and icmp can handle the defaulting correctly.

gujun4990 commented 1 year ago

Thanks for your reply, I add "tcp_flags=2" options and traceflow is succeed. The shouldn't a bug, but maybe need to optimize the documents about traceflow. BTW, I found only a request packet from orig to dest, not a reply packet. If the reply packet is dropped, the connection should be failed among pods. But the traceflow is succeed actually. So I wonder whether the openflow table do something that I miss.

tnqn commented 1 year ago

Thanks for your reply, I add "tcp_flags=2" options and traceflow is succeed. The shouldn't a bug, but maybe need to optimize the documents about traceflow. BTW, I found only a request packet from orig to dest, not a reply packet. If the reply packet is dropped, the connection should be failed among pods. But the traceflow is succeed actually. So I wonder whether the openflow table do something that I miss.

The injected Traceflow packets are discarded intentionally (even if the trace result is ALLOW) before being forwarded to Pod interface to avoid affecting applications.

antrea-io / antrea

Traceflow failed when using tcp protocol #4897