antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.67k stars 370 forks source link

On Azure Ubuntu vm, upon antrea-agent restart agent cannot reach the API server #5221

Closed Anandkumar26 closed 10 months ago

Anandkumar26 commented 1 year ago

Describe the bug Upon restarting antrea-agent on Ubuntu VM, antrea-agent is unable to receive any events on the ExternalNode.

To Reproduce

Expected

Actual behavior

Versions: Antrea version v.1.11.0

Additional context Antrea-agent logs before restart

...
I0707 13:50:04.162357       1 agent.go:1297] "Initializing VM config" ExternalNode="ub2004-2"
I0707 13:50:04.169972       1 discoverer.go:80] Starting ServiceCIDRDiscoverer
I0707 13:51:24.199358       1 agent.go:1312] "Finished VM config initialization" ExternalNode="ub2004-2"
....
I0707 13:51:25.520278       1 external_node_controller.go:261] "Adding ExternalNode" ExternalNode="vm-ns/ub2004-2"
....
I0707 13:52:38.598065       1 requestheader_controller.go:183] Shutting down RequestHeaderAuthRequestController
I0707 13:52:38.598164       1 configmap_cafile_content.go:223] "Shutting down controller" name="antrea-ca::kube-system::antrea-ca::ca.crt"

Antrea-agent logs after restart

I0707 13:52:49.335933       1 agent.go:98] Starting Antrea agent (version v1.12.0)
W0707 13:52:49.343389       1 env.go:88] Environment variable POD_NAMESPACE not found
W0707 13:52:49.343394       1 env.go:126] Failed to get Pod Namespace from environment. Using "kube-system" as the Antrea Service Namespace
I0707 13:52:49.343496       1 ovs_client.go:71] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I0707 13:52:50.348420       1 ovs_client.go:90] Not connected yet, will try again in 2s
I0707 13:52:50.348867       1 agent.go:397] Setting up node network
I0707 13:52:50.348879       1 agent.go:1297] "Initializing VM config" ExternalNode="ub2004-2"
I0707 13:52:50.350312       1 discoverer.go:80] Starting ServiceCIDRDiscoverer

OVS configuration after restart

$docker ps
CONTAINER ID   IMAGE                          COMMAND                  CREATED         STATUS          PORTS     NAMES
1cfbaa93356f   antrea/antrea-ubuntu:v1.12.0   "antrea-agent --conf…"   5 minutes ago   Up 13 seconds             antrea-agent
5962b3040d38   antrea/antrea-ubuntu:v1.12.0   "start_ovs"              5 minutes ago   Up 13 seconds             antrea-ovs

$docker exec -it 1cfbaa93356f ovs-vsctl show
64ff2c6c-0204-4577-a86a-5a800d26ad81
    Bridge br-int
        datapath_type: system
        Port "eth0~"
            Interface "eth0~"
        Port eth0
            Interface eth0
                type: internal
    ovs_version: "2.17.6"

Netplan configuration file

cat 50-cloud-init.yaml 
network:
    ethernets:
        eth0:
            dhcp4: true
            dhcp4-overrides:
                route-metric: 100
            dhcp6: false
            match:
                driver: hv_netvsc
                macaddress: 60:45:bd:04:30:0e
            set-name: eth0
    version: 2
antoninbas commented 1 year ago

@wenyingd @ceclinux are you looking into this?

wenyingd commented 1 year ago

It is observed previously that agent fails to connect to kube-apiserver because DNS is not working on the OVS internal interface after a restart. That is because the netplan configurations requires dhcp client to work on the interface with driver type "hv_netvsc" and mac address "60:45:bd:04:30:0e". But the OVS internal interface does not match the configured driver type.

Today I got a new observation that the VM connectivity may be lost on the VM after agent restart. Via the web console, I found that the packets entering VM from uplink is not forwarded to the internal interface correctly even if we manually added OpenFlow entries. The setup is running OVS userspace processes inside containers, and a restart on antrea-agent service also restarts antrea-ovs container, so the flows added in the last round is lost. Instead, a normal flow is added by default by OVS userspace. But the strange finding is no packets hit the openflow entries. Personally, I doubt this observation on the connectivity lost issue is related with running OVS processes inside containers.

A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface. In this way, everyting is recoverd during the time agent is not running, and agent will start working on a fresh environment when it is up again. This is similar to what agent is doing in the container scenraio with FlexibleIPAM enabled.

@Anandkumar26 @reachjainrahul @tnqn @antoninbas Any ideas about this?

wenyingd commented 1 year ago

Another tested solution is modifying the netplan file for the candidate ovs interface under path /etc/netplan, e.g.

# cat /etc/netplan/50-cloud-init.yaml 
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            dhcp4: true
            dhcp4-overrides:
                route-metric: 100
            dhcp6: false
            match:
                macaddress: 60:45:bd:07:a1:c2
            set-name: eth0
    version: 2

Then we need to apply netplana and reload the configuration with networkctl

# netplan apply
# networkctl reload

After this, a new net-plan-rules file is added.

# cat /var/run/udev/rules.d/99-netplan-eth0.rules 
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="60:45:bd:07:a1:c2", NAME="eth0"

Then we can start antra-agent service on this VM. In this way, networkctl would manage the OVS internal interface which is created by antrea-agent. In my test, it can resolve the issue.

# networkctl list
IDX LINK       TYPE     OPERATIONAL SETUP     
  1 lo         loopback carrier     unmanaged 
  2 eth0~      ether    carrier     unmanaged 
  3 docker0    bridge   no-carrier  unmanaged 
 14 ovs-system ether    off         unmanaged 
 20 eth0       ether    routable    configured

root@ub2004:/home/nsxadmin# systemd-resolve --status
Global
       LLMNR setting: no                  
MulticastDNS setting: no                  
  DNSOverTLS setting: no                  
      DNSSEC setting: no                  
    DNSSEC supported: no                  
          DNSSEC NTA: 10.in-addr.arpa     
                      16.172.in-addr.arpa 
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa 
                      18.172.in-addr.arpa 
                      19.172.in-addr.arpa 
                      20.172.in-addr.arpa 
                      21.172.in-addr.arpa 
                      22.172.in-addr.arpa 
                      23.172.in-addr.arpa 
                      24.172.in-addr.arpa 
                      25.172.in-addr.arpa 
                      26.172.in-addr.arpa 
                      27.172.in-addr.arpa 
                      28.172.in-addr.arpa 
                      29.172.in-addr.arpa 
                      30.172.in-addr.arpa 
                      31.172.in-addr.arpa 
                      corp                
                      d.f.ip6.arpa        
                      home                
                      internal            
                      intranet            
                      lan                 
                      local               
                      private             
                      test                

Link 20 (eth0)
      Current Scopes: DNS                                                
DefaultRoute setting: yes                                                
       LLMNR setting: yes                                                
MulticastDNS setting: no                                                 
  DNSOverTLS setting: no                                                 
      DNSSEC setting: no                                                 
    DNSSEC supported: no                                                 
  Current DNS Server: 168.63.129.16                                      
         DNS Servers: 168.63.129.16                                      
          DNS Domain: rrdlljco2h1ejhrnvbxaefqbzc.dx.internal.cloudapp.net

Link 14 (ovs-system)
      ...

Link 3 (docker0)
      ...

Link 2 (eth0~)
      Current Scopes: none
DefaultRoute setting: no  
       LLMNR setting: yes 
MulticastDNS setting: no  
  DNSOverTLS setting: no  
      DNSSEC setting: no  
    DNSSEC supported: no 
antoninbas commented 1 year ago

It sounds like the issue is caused specifically by the antrea-agent restarting?

A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface.

But we can't assume that the agent will always exit gracefully? It could be killed ant not get a chance to do clean-up?

It sounds like both solutions may not be mutually exclusive. We could implement agent cleanup, while at the same time providing netplan configuration(s) for specific cloud providers / distributions?

wenyingd commented 1 year ago

It sounds like the issue is caused specifically by the antrea-agent restarting?

I think the root cause is the OVS internal interface created by antrea-agent is not managed by networkctl ( because of the driver type limitation on azure ), although IP/routes are migrated statically, the DNS (manged by networkctl) configurations are lost. As antrea-agent is connected to the apiserver/antrea-controller in the runtime, it does not block agent working processes. But after a restart, agent fails to resolve the domain name without DNS configuration.

My thought to restore configurations after agent is stopped, is to ensure DNS is working well when agent is re-start. So the best sulotion is to make networkctl can manage the antrea-agent created interface.

wenyingd commented 1 year ago

Update for the part that proving a rule to match openvswitch driver. We don't need to modify the existing netplan configurations, instead, we can provide a udev rule and enforce networkctl to reload it, then networkd could identify the eth0 created by antrea-agent, like this,

# cat /home/nsxadmin/11-antrea-eth0.network 
[Match]
MACAddress=60:45:bd:07:a1:c2
Name=eth0

[Network]
DHCP=ipv4
LinkLocalAddressing=ipv6

[DHCP]
RouteMetric=100
UseMTU=true

# cp /home/nsxadmin/11-antrea-eth0.network /run/systemd/network/

# networkctl reload

We can provide such an template for ubuntu if running on azure, and the boostrap script can modify the mac address in the template with the real value of the VM and copy it to the correct path, then enforce networkctl to reload the configuration. In this way, the OVS type interface can be identified and managed by networkd after antrea-agent renames the uplink and moves it to OVS.

wenyingd commented 1 year ago

We could implement agent cleanup, while at the same time providing netplan configuration(s) for specific cloud providers / distributions?

@antoninbas about the cleanup, I think you mean a separate script or binary outside of antrea-agent. The latest solution of vm agent supports running ovs userspace nside containers, there exists a risk that ovs commands can not be accessed by the cleanup module if ovs processes are not running. Would it introduce some failures or unexpected legacy configurations on the VMs in the cleanup script?

antoninbas commented 1 year ago

@wenyingd I was just replying to your earlier comment and suggestion:

A workaround in my thought is to restore the configurations when agent is stopping, including removing the uplink from OVS, deleting internal interface, renaming the uplink back, and configuring IP/routes back to the uplink interface. In this way, everyting is recoverd during the time agent is not running, and agent will start working on a fresh environment when it is up again. This is similar to what agent is doing in the container scenraio with FlexibleIPAM enabled.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days