Closed Anandkumar26 closed 10 months ago
@wenyingd @ceclinux are you looking into this?
It is observed previously that agent fails to connect to kube-apiserver because DNS is not working on the OVS internal interface after a restart. That is because the netplan configurations requires dhcp client to work on the interface with driver type "hv_netvsc" and mac address "60:45:bd:04:30:0e". But the OVS internal interface does not match the configured driver type.
Today I got a new observation that the VM connectivity may be lost on the VM after agent restart. Via the web console, I found that the packets entering VM from uplink is not forwarded to the internal interface correctly even if we manually added OpenFlow entries. The setup is running OVS userspace processes inside containers, and a restart on antrea-agent service also restarts antrea-ovs container, so the flows added in the last round is lost. Instead, a normal flow is added by default by OVS userspace. But the strange finding is no packets hit the openflow entries. Personally, I doubt this observation on the connectivity lost issue is related with running OVS processes inside containers.
A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface. In this way, everyting is recoverd during the time agent is not running, and agent will start working on a fresh environment when it is up again. This is similar to what agent is doing in the container scenraio with FlexibleIPAM enabled.
@Anandkumar26 @reachjainrahul @tnqn @antoninbas Any ideas about this?
Another tested solution is modifying the netplan file for the candidate ovs interface under path /etc/netplan
, e.g.
# cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
ethernets:
eth0:
dhcp4: true
dhcp4-overrides:
route-metric: 100
dhcp6: false
match:
macaddress: 60:45:bd:07:a1:c2
set-name: eth0
version: 2
Then we need to apply netplana and reload the configuration with networkctl
# netplan apply
# networkctl reload
After this, a new net-plan-rules file is added.
# cat /var/run/udev/rules.d/99-netplan-eth0.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="60:45:bd:07:a1:c2", NAME="eth0"
Then we can start antra-agent service on this VM. In this way, networkctl would manage the OVS internal interface which is created by antrea-agent. In my test, it can resolve the issue.
# networkctl list
IDX LINK TYPE OPERATIONAL SETUP
1 lo loopback carrier unmanaged
2 eth0~ ether carrier unmanaged
3 docker0 bridge no-carrier unmanaged
14 ovs-system ether off unmanaged
20 eth0 ether routable configured
root@ub2004:/home/nsxadmin# systemd-resolve --status
Global
LLMNR setting: no
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: no
DNSSEC supported: no
DNSSEC NTA: 10.in-addr.arpa
16.172.in-addr.arpa
168.192.in-addr.arpa
17.172.in-addr.arpa
18.172.in-addr.arpa
19.172.in-addr.arpa
20.172.in-addr.arpa
21.172.in-addr.arpa
22.172.in-addr.arpa
23.172.in-addr.arpa
24.172.in-addr.arpa
25.172.in-addr.arpa
26.172.in-addr.arpa
27.172.in-addr.arpa
28.172.in-addr.arpa
29.172.in-addr.arpa
30.172.in-addr.arpa
31.172.in-addr.arpa
corp
d.f.ip6.arpa
home
internal
intranet
lan
local
private
test
Link 20 (eth0)
Current Scopes: DNS
DefaultRoute setting: yes
LLMNR setting: yes
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: no
DNSSEC supported: no
Current DNS Server: 168.63.129.16
DNS Servers: 168.63.129.16
DNS Domain: rrdlljco2h1ejhrnvbxaefqbzc.dx.internal.cloudapp.net
Link 14 (ovs-system)
...
Link 3 (docker0)
...
Link 2 (eth0~)
Current Scopes: none
DefaultRoute setting: no
LLMNR setting: yes
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: no
DNSSEC supported: no
It sounds like the issue is caused specifically by the antrea-agent restarting?
A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface.
But we can't assume that the agent will always exit gracefully? It could be killed ant not get a chance to do clean-up?
It sounds like both solutions may not be mutually exclusive. We could implement agent cleanup, while at the same time providing netplan configuration(s) for specific cloud providers / distributions?
It sounds like the issue is caused specifically by the antrea-agent restarting?
I think the root cause is the OVS internal interface created by antrea-agent is not managed by networkctl ( because of the driver type limitation on azure ), although IP/routes are migrated statically, the DNS (manged by networkctl) configurations are lost. As antrea-agent is connected to the apiserver/antrea-controller in the runtime, it does not block agent working processes. But after a restart, agent fails to resolve the domain name without DNS configuration.
My thought to restore configurations after agent is stopped, is to ensure DNS is working well when agent is re-start. So the best sulotion is to make networkctl can manage the antrea-agent created interface.
Update for the part that proving a rule to match openvswitch driver. We don't need to modify the existing netplan configurations, instead, we can provide a udev rule and enforce networkctl to reload it, then networkd could identify the eth0 created by antrea-agent, like this,
# cat /home/nsxadmin/11-antrea-eth0.network
[Match]
MACAddress=60:45:bd:07:a1:c2
Name=eth0
[Network]
DHCP=ipv4
LinkLocalAddressing=ipv6
[DHCP]
RouteMetric=100
UseMTU=true
# cp /home/nsxadmin/11-antrea-eth0.network /run/systemd/network/
# networkctl reload
We can provide such an template for ubuntu if running on azure, and the boostrap script can modify the mac address in the template with the real value of the VM and copy it to the correct path, then enforce networkctl to reload the configuration. In this way, the OVS type interface can be identified and managed by networkd after antrea-agent renames the uplink and moves it to OVS.
We could implement agent cleanup, while at the same time providing netplan configuration(s) for specific cloud providers / distributions?
@antoninbas about the cleanup, I think you mean a separate script or binary outside of antrea-agent. The latest solution of vm agent supports running ovs userspace nside containers, there exists a risk that ovs commands can not be accessed by the cleanup module if ovs processes are not running. Would it introduce some failures or unexpected legacy configurations on the VMs in the cleanup script?
@wenyingd I was just replying to your earlier comment and suggestion:
A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface. In this way, everyting is recoverd during the time agent is not running, and agent will start working on a fresh environment when it is up again. This is similar to what agent is doing in the container scenraio with FlexibleIPAM enabled.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days
Describe the bug Upon restarting antrea-agent on Ubuntu VM, antrea-agent is unable to receive any events on the ExternalNode.
To Reproduce
Expected
Actual behavior
Versions: Antrea version v.1.11.0
Additional context
Antrea-agent logs before restart
Antrea-agent logs after restart
OVS configuration after restart
Netplan configuration file