antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.65k stars 362 forks source link

Antrea does not run on Photon OS 3 #591

Closed antoninbas closed 4 years ago

antoninbas commented 4 years ago

Describe the bug When creating a single node cluster with kubeadm on a Photon OS VM, Pod Networking does not work. For example, trying to ping the local gw0 from any Pod does not work. When looking at the Antrea agent logs, one can see the following:

time="2020-04-03T01:13:07Z" level=info msg="Openflow Connection for new switch: 00:00:0a:a0:6f:8d:a6:4c"
I0403 01:13:07.114199       1 ofctrl_bridge.go:178] OFSwitch is connected: 00:00:0a:a0:6f:8d:a6:4c
time="2020-04-03T01:13:07Z" level=error msg="Received bundle error msg: [4 4 0 120 0 0 0 51 79 78 70 0 0 0 8 253 0 0 0 4 0 0 0 1 4 14 0 96 0 0 0 51 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 200 255 255 255 255 255 255 255 255 255 255 255 255 0 0 0 0 0 1 0 10 128 0 10 2 8 0 0 0 0 0 0 0 0 4 0 32 0 0 0 0 255 255 0 24 0 0 35 32 0 35 0 0 0 0 0 0 255 240 31 0 0 0 0 0]"
time="2020-04-03T01:13:07Z" level=error msg="Received bundle error msg: [4 4 0 128 0 0 0 59 79 78 70 0 0 0 8 253 0 0 0 4 0 0 0 1 4 14 0 104 0 0 0 59 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 105 0 0 0 0 0 0 190 255 255 255 255 255 255 255 255 255 255 255 255 0 0 0 0 0 1 0 22 128 0 10 2 8 0 0 1 211 8 0 0 0 33 0 0 0 33 0 0 0 4 0 32 0 0 0 0 255 255 0 24 0 0 35 32 0 35 0 1 0 0 0 0 255 240 110 0 0 0 0 0]"
time="2020-04-03T01:13:07Z" level=error msg="Received bundle error msg: [4 4 0 168 0 0 0 57 79 78 70 0 0 0 8 253 0 0 0 4 0 0 0 1 4 14 0 144 0 0 0 57 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 105 0 0 0 0 0 0 200 255 255 255 255 255 255 255 255 255 255 255 255 0 0 0 0 0 1 0 34 128 0 10 2 8 0 0 1 211 8 0 0 0 33 0 0 0 33 0 1 1 8 0 0 0 1 0 0 255 255 0 0 0 0 0 0 0 4 0 56 0 0 0 0 255 255 0 48 0 0 35 32 0 35 0 1 0 0 0 0 255 240 110 0 0 0 0 0 255 255 0 24 0 0 35 32 0 7 0 31 0 1 214 4 0 0 0 0 0 0 0 32]"

BTW, @wenyingd do you think these log messages can be displayed in a more user-friendly format :) ?

If I dump the flows, I can see that table 30 is empty, and this flow is therefore missing:

table=30, priority=200,ip actions=ct(table=31,zone=65520)

Trying to add the flow manually gives the following error:

root@photon-machine:/# ovs-ofctl add-flow br-int 'table=30,priority=200,ip,actions=ct(table=31,zone=65520)'
OFPT_ERROR (xid=0x8): NXBAC_CT_DATAPATH_SUPPORT
OFPT_FLOW_MOD (xid=0x8): ADD table:30 priority=200,ip actions=ct(table=31,zone=65520)

To Reproduce

Versions: Antrea: v0.5.1

root@photon-machine [ ~ ]# modinfo openvswitch
filename:       /lib/modules/4.19.15-1.ph3-esx/kernel/net/openvswitch/openvswitch.ko.xz
alias:          net-pf-16-proto-16-family-ovs_ct_limit
alias:          net-pf-16-proto-16-family-ovs_meter
alias:          net-pf-16-proto-16-family-ovs_packet
alias:          net-pf-16-proto-16-family-ovs_flow
alias:          net-pf-16-proto-16-family-ovs_vport
alias:          net-pf-16-proto-16-family-ovs_datapath
license:        GPL
description:    Open vSwitch switching datapath
depends:        nf_conntrack,nf_nat,nf_conncount,nf_nat_ipv6,nf_nat_ipv4,nf_defrag_ipv6,nsh
intree:         Y
name:           openvswitch
vermagic:       4.19.15-1.ph3-esx SMP mod_unload
antoninbas commented 4 years ago

@jianjuns @abhiraut FYI

antoninbas commented 4 years ago

@wenyingd let me know if you need more information. I know we don't explicitly document that we support Photon OS, but the kernel looks recent to me and so I'm surprised that we see this error. If you want me to try to install something on my Photon OS VM, please let me know. Unfortunately I cannot give you SSH access since the VM is running locally on my laptop...

wenyingd commented 4 years ago

@antoninbas It looks photon doesn't support "ct" feature on the OVS. Could you help check the OVS kernel module version on the testing VM? In my memory, the OVS kernel module version should be higher than 2.6.

tnqn commented 4 years ago

I remember @edwardbadboy found an issue that photos OS didn't compile multiple conntrack zone support by default. It looks like similar.

antoninbas commented 4 years ago

I just saw this: https://github.com/vmware/photon/blob/master/SPECS/linux/linux-esx.spec#L322

Maybe a slightly more recent version of Photon OS will work?

tnqn commented 4 years ago

Yes, I guess so.

edwardbadboy commented 4 years ago

Hi Antonin,

Would you check the following command output?

grep CONFIG_NF_CONNTRACK_ZONES /boot/config-$(uname -r)

See if it's CONFIG_NF_CONNTRACK_ZONES=y. If not, it's the cause of the failure.

It could be when they compile the kernel, the zone support of conntrack module was not enabled. Previously when I tried Antrea on Photon OS, I recompiled the Photon kernel with that flag set to "y" ( https://github.com/edwardbadboy/photon/commit/a6c3c108302e1d682180c2499ccada14dd16e39f )

I thought last time Jianjun said Photon developers agreed to turn on the switch by default. Let me check if the upstream Photon has that change. If not, I can submit the pull request to Photon upstream.

edwardbadboy commented 4 years ago

I just saw this: https://github.com/vmware/photon/blob/master/SPECS/linux/linux-esx.spec#L322

Maybe a slightly more recent version of Photon OS will work?

Seems they already made the change. Let's use a more recent Photon OS version then.

antoninbas commented 4 years ago

I ran tdnf upgrade linux-esx and it fixed that specific issue. Pod networking is still not working, so I'm looking into it.

antoninbas commented 4 years ago

Alright this was a combination of multiple things, but I managed to make it work:

Maybe these things are worth documenting somewhere? @jianjuns

tnqn commented 4 years ago

Perhaps we could solve the 3rd with antrea-agent if it's common for other CNIs to add such rules for their traffic. Right now we only add rules to FORWARD chain.

jianjuns commented 4 years ago

@antoninbas agreed we should document CONFIG_NF_CONNTRACK_ZONES and firewall rules. CONFIG_NF_CONNTRACK_ZONES is a known issue for Photon OS, and last time we pushed a change to enable it for the vSphere build.

abhiraut commented 4 years ago

Perhaps we could solve the 3rd with antrea-agent if it's common for other CNIs to add such rules for their traffic. Right now we only add rules to FORWARD chain.

maybe check if Input policy is drop and only then apply the rule ?