Open meibensteiner opened 3 weeks ago
Deleting the antrea-agent pod makes the service accessible again for a while.
This is probably about external IP traffic forwarding from host interface to the primary OVS bridge when the interface is moved to the secondary bridge. @tnqn and @xliuxu should have some ideas.
This is probably because the ARP responder which is in charge of responding to ARP requests for Service LoadBalancerIPs is initialized before the secondary bridge creation. When the ARP responder is initialized, it "resolves" the transport interface by name (enp0s1
), and uses the interface to listen for ARP requests and send ARP replies. The interface handle used for this is never updated after that. However, once the secondary bridge is created and the host interface (transport interface) is assigned to the bridge, the interface handle held by the ARP responder is no longer "valid" (it now points to enp0s1~
instead of the new enp0s1
host interface).
Because secondary bridge initialization can be delayed by up to ~10s after the agent starts (see https://github.com/antrea-io/antrea/pull/6504), and probably because we send a gratuitous ARP for the Service LoadBalancerIP when we first process it, you would probably have connectivity to the Service until the ARP entry expires in the client / router, which seems to be what you are observing.
If we want to support this case (ServiceExternalIP + transport interface assigned to SecondaryNetwork bridge), we can do one of the following: 1) periodically resolve the interface by name in the ARP responder 2) introduce a communication channel so that the ARP responder can resolve the interface by name again after the secondary bridge is initialized 3) order things so that the ARP responder does not resolve the interface by name until the secondary bridge has been initialized
The first option may be the simplest and the least risky. It would also be a good way to confirm that my analysis is correct.
@meibensteiner If you can capture traffic, you can also confirm that ARP requests are not answered correctly, causing the connectivity issue
Can confirm. Unanswered ARP requests.
10:57:01.819248 ARP, Request who-has 192.168.66.251 tell node2, length 28
10:57:02.842307 ARP, Request who-has 192.168.66.251 tell node2, length 28
10:57:06.365280 ARP, Request who-has 192.168.66.251 tell node2, length 28
10:57:06.465441 IP _gateway.57621 > 192.168.66.255.57621: UDP, length 44
10:57:07.383679 ARP, Request who-has 192.168.66.251 tell node2, length 28
10:57:08.408941 ARP, Request who-has 192.168.66.251 tell node2, length 28
10:57:18.832761 ARP, Request who-has 192.168.66.251 tell _gateway, length 28
This is probably because the ARP responder which is in charge of responding to ARP requests for Service LoadBalancerIPs is initialized before the secondary bridge creation. When the ARP responder is initialized, it "resolves" the transport interface by name (
enp0s1
), and uses the interface to listen for ARP requests and send ARP replies. The interface handle used for this is never updated after that. However, once the secondary bridge is created and the host interface (transport interface) is assigned to the bridge, the interface handle held by the ARP responder is no longer "valid" (it now points toenp0s1~
instead of the newenp0s1
host interface).Because secondary bridge initialization can be delayed by up to ~10s after the agent starts (see #6504), and probably because we send a gratuitous ARP for the Service LoadBalancerIP when we first process it, you would probably have connectivity to the Service until the ARP entry expires in the client / router, which seems to be what you are observing.
If we want to support this case (ServiceExternalIP + transport interface assigned to SecondaryNetwork bridge), we can do one of the following:
- periodically resolve the interface by name in the ARP responder
- introduce a communication channel so that the ARP responder can resolve the interface by name again after the secondary bridge is initialized
- order things so that the ARP responder does not resolve the interface by name until the secondary bridge has been initialized
The first option may be the simplest and the least risky. It would also be a good way to confirm that my analysis is correct.
Option 1 should be ok to resolve the issue as the transport interface will not be changed frequently. I can work on a fix for this issue.
Happy to test it if you provide me with an image! 😬😄
Will this make it into the 2.2 release?
@meibensteiner We should be able to ship the fix with 2.2. Btw could you help to confirm if a manual restart of the agent could help to fix the issue as a workaround?
It fixes it only for a few seconds
Ah, I see. This is expected because antrea-agent will revert the bridging of host interfaces upon exit. I am currently testing the fix and need more tests for the ipv6 implementation.
Describe the bug Using both the ServiceExternalIP and SecondaryNetwork feature with a single host interface breaks the ServiceExternalIP feature. Services of type Loadbalancer are no longer accessible.
To Reproduce
Helm chart values:
Expected The host interface should be moved to br-ext Service should still be accessible
Actual behavior The host interface is moved to br-ext Service is no longer accessible outside the cluster after interface was attached to br-ext and after waiting a few minutes Initially it weirdly works, but after a few minutes the service becomes inaccessible It seems like the outage starts after those log lines. I have no way of correlating those logs to the problem though.
Versions:
Additional context Since the node is accessible this time, I can actually create a supportbundle :)
support-bundles_20240821T105205+0200.zip