Closed ionb42 closed 1 year ago
I can confirm that restarting frr on 10.158.1.42, without changing anything else, resolved the issue.
I suspect I have an explanation for the above, after having thought about it some more. It revolves around net.ipv4.tcp_l3mdev_accept = 1, which is set on the machine running frr as per the recommendation from an older frr manual (v7.1, I believe).
Anyway, the theory is as follows: this particular frr instance does everything inside a non-default VRF, but at some point somebody accidentally configured (and then removed) an MSDP peer in the default VRF. That created an MSDP socket in the default VRF, which likely wasn't closed when the MSDP peer was removed. With tcp_l3mdev_accept=1, both MSDP sockets (the one in the default VRF and the one in the non-default VRF) could accept an incoming MSDP connection, but if the one in the default VRF got it first, it would then reject it because the default VRF had no MSDP peers configured.
I verified this on a test instance. After adding an MSDP peer in listening mode (source IP greater than peer IP) in the non-default VRF, a listening MSDP socket showed up in 'ss -tnap' in that VRF:
LISTEN 0 3 *%mconvrf:639 *:* users:(("pimd",pid=15250,fd=40))
The socket did not go away as I removed that listening peer. Then I added a listening MSDP peer in the default VRF and sure enough a second listening MSDP showed up, in the default VRF:
LISTEN 0 3 *:639 *:* users:(("pimd",pid=15250,fd=41))
LISTEN 0 3 *%mconvrf:639 *:* users:(("pimd",pid=15250,fd=40))
Again, the listening socket did not go away after removing the listening peer.
Arguably tcp_l3mdev_accept shouldn't be set anyway, so that's what we're going to do on our side. Whether the MSDP socket should linger around when the last listening peer is removed is a question I leave to people who know the frr code better. :-)
This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose
label in order to avoid having this issue closed.
This issue will be automatically closed in the specified period unless there is further activity.
I ran into a weird issue with MSDP where frr is dropping an incoming connection from a configured MSDP peer, claiming it's not actually configured. This is likely difficult to reproduce and will probably go away on an frr restart (planned for this evening), but I figured people more familiar with the code might be able to spot something in the msdp code anyway. Frr is version 8.1, but a quick perusing of the git commit log shows nothing that might be potentially relevant to this issue in newer releases.
How it ended up in this state: I'm not totally sure. This particular peer was added and removed a few times, that's just about the only thing I can think of. Initially it worked, but then the connection stopped getting established. It's possible that at some point the peer was added outside the vrf block, by mistake, but it was removed afterwards. When I enabled msdp events debugging, this is what I got in the log, repeated every 30 seconds.
Removing and re-adding the peer shows up fine in the log, but frr continues to reject the incoming connection from it:
This is the state of the msdp peer as shown by frr:
There is nothing in the default vrf, either in the running config (see below) or in frr output:
The other 3 MSDP peers are working fine.
The relevant frr config fragment is:
There are BGP peerings with the same peers, all of which are working:
The local 10.158.1.42 interface is properly in the mconvrf vrf:
This is what I see in tcpdump:
So the connection is established and then dropped right away by 10.158.1.42, before any data is exchanged.
[X] Did you check if this is a duplicate issue? [ ] Did you test it on the latest FRRouting/frr master branch?
The OS is Linux (CentOS7) with a local custom built kernel (vanilla 5.4.49).