Nordix / Meridio

Facilitator of attraction and distribution of external traffic within Kubernetes via secondary networks
https://meridio.nordix.org
Apache License 2.0
46 stars 9 forks source link

TAPA unreachable for ~10 sec when its NSM interface is replaced #548

Open zolug opened 1 month ago

zolug commented 1 month ago

Describe the bug When an NSM interface in a TAPA is replaced during NSM heal (old connection is closed part of which the old interface is removed) the new interface most probably will end up with a different MAC address. Yet, the IP address(es) assigned by the proxy component would be most likely the same.

During such NSM heal event the LBs currently won't be informed about the temporary unavailability of said TAPA/Target. However, in an LB the linux neighbor cache might contain a related neighbor entry (with the old/invalid MAC). Renewal of the neighbor entry is delayed by delay_first_probe_time sec (default: 5) and then initially probes are sent out to the invalid MAC in the cache for ucast_solicit times (defaults: 3). So, even if NSM heal would replace the NSM interface in TAPA instantly, there would be at least 8 seconds delay until LBs could learn the new MAC address.

Context

zolug commented 1 month ago

Unfortunately, enabling arp_notify in the TAPA won't resolve the problem. That's because of the chained architecture of NSM meaning the interface is first created and its state is set to UP and then the addresses are configured afterwards. While, linux for IPV4 won't send a gARP in such case: https://github.com/torvalds/linux/blob/v6.8/net/ipv4/devinet.c#L1606

For IPv6, ndisc_notify does work though: https://github.com/torvalds/linux/blob/v6.8/net/ipv6/addrconf.c#L4292

There's also the the issue, how to set any sysctl in the TAPA i.e. application POD. A privileged init container could be used for this purpose. But if that's not feasible, a multus network-attachment-definition relying on tuning CNI could be used to set ndisc_notify for the default interface (thus NSM interfaces created later could inherit the default value). Or NSM feature set could be extended to include setting arp_notify/ndisc_notify sysctls for a particular interface.

zolug commented 1 month ago

ideas to resolve:

1., Lowering probe delay by tweaking sysctl params in LB, primarily ucast_solicit and delay_first_probe_time values: Unfortunately, for the 'default' interface these two cannot be set in other than the init linux network namespace. Therefore, we cannot employ a privileged init container or a tuning network-attachment-definition. Therefore, in order to work these values should be set per NSM interface instead by some privileged entity.

2., Use arping program in TAPA (NSC) to send out gARP: It would require CAP_NET_RAW privilege at least. Any privilege requirement towards TAPA users must be avoided. Could be outsourced to NSM.

3., Change NSM Target registration protocol. E.g. report if associated NSM connection's state becomes DELETE, upon which LBs could remove any cached neighbor entries. This would be sensitive to NSP unavailability.

4., Proxy to send promisc gARP instead of the TAPA when a new NSM connection gets established between a TAPA and said proxy: Proxy would need to learn the MAC of the NSM interface in the TAPA. For which it could use some ping or anything. Once, the MAC was learned, it could do a gARP on behalf of the TAPA using arping program with promisc mode. Due to the promisc mode, most probably it would require additional privileges than just CAP_NET_RAW (needed to create RAW socket to send out gARP).

5., Using some custom communication channel proxy could signal the Target IPs to LBs when a new TAPA->proxy connection gets established. So, that the LB could remove any possible cached neighbor entries.

6., LBs could rely on NSM's Connection Monitor feature to learn if TAPA->Proxy NSM connection state changes to DELETE, and extract associated IPs from the connection to issue neighbor delete operations. Unfortunately, MonitorScopeSelector in NSM currently only supports PathSegment related filters like path name, id and token. Which are not known by LB, and can change dynamically over time. On the other hand, without filters there's a risk of receiving way too much events depending on the deployments using NSM. At the moment I don't see any obstacle or drawback to add a new filter option based on network service name.