Exa-Networks / exabgp

The BGP swiss army knife of networking
Other
2.06k stars 441 forks source link

'adj-rib out' is incorrect after reloading configuration with an offline router #1126

Open longmalx opened 1 year ago

longmalx commented 1 year ago

Bug Description

When a neighbor is unavailable during configuration changes, we have noticed that the adj-rib may contain routes which have been removed from the configuration. These remain even when the neighbor becomes available again.

All route changes are performed by updating exabgp.conf and issuing a reload.

To Reproduce

I have 2 configuration files (contained in exabgp.conf.zip):

I run commands to set the configuration to empty, restart, add the routes, reload, remove routes, reload

[root@edx-cms exabgp]# cp exabgp.conf.empty /etc/exabgp/exabgp.conf[root@edx-cms exabgp]# systemctl restart exabgp
[root@edx-cms exabgp]# cp exabgp.conf.routes /etc/exabgp/exabgp.conf
[root@edx-cms exabgp]# systemctl reload exabgp
[root@edx-cms exabgp]# exabgpcli show adj-rib out extensive;
neighbor 192.168.200.254 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow
destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
neighbor 192.168.200.253 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow
destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
[root@edx-cms exabgp]# cp exabgp.conf.empty /etc/exabgp/exabgp.conf
[root@edx-cms exabgp]# systemctl reload exabgp
[root@edx-cms exabgp]# exabgpcli show adj-rib out extensive;
neighbor 192.168.200.253 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow
destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
[root@edx-cms exabgp]# exabgpcli show neighbor summary
Peer            AS        up/down state       |     #sent     #recvd
192.168.200.254 65000     0:01:33 established           6          4

It can be seen that the neighbor which is not available has an incorrect entry in the RIB. If I repeat the test with both neighbors connected, then there is no issue.

Expected behavior

After removing all routes I would expect there to be no entries when I run 'adj-rib out' command.

Environment (please complete the following information):

Additional context

Here are the logs from a reproduction - exabgp.log:

[root@edx-cms exabgp]# date
Tue Nov 15 06:29:26 EST 2022
[root@edx-cms exabgp]# cp exabgp.conf.empty /etc/exabgp/exabgp.conf
[root@edx-cms exabgp]# systemctl restart exabgp
[root@edx-cms exabgp]# cp exabgp.conf.routes /etc/exabgp/exabgp.conf
[root@edx-cms exabgp]# systemctl reload exabgp
[root@edx-cms exabgp]# exabgpcli show adj-rib out extensive;
neighbor 192.168.200.254 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
neighbor 192.168.200.253 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
[root@edx-cms exabgp]# cp exabgp.conf.empty /etc/exabgp/exabgp.conf
[root@edx-cms exabgp]# systemctl reload exabgp
[root@edx-cms exabgp]# exabgpcli show adj-rib out extensive;
neighbor 192.168.200.253 local-ip 0.0.0.0 local-as 65000 peer-as 65000 router-id 10.10.243.170 family-allowed in-open ipv4 flow flow destination-ipv4 172.16.1.1/32 extended-community redirect:666:666
[root@edx-cms exabgp]# systemctl stop exabgp
thomas-mangin commented 1 year ago

This behaviour is indeed unintuitive and a side effect of the code design (I would have to check to be 100% sure), but there is a reason for it, and as changing it would change current users' expectations. I will look at possibly changing it in master (or documenting it) but probably not in the 4.x releases.

Currently, routes can be accepted via the API or the command line. We are unfortunately not "tagging" the source. If the route is from the API, it should remain announced, if it came from the configuration file, I agree with you position that it should be removed, the code clearly does not do this correctly.

SIGNAL control of ExaBGP was designed before we decided that we should really control the program via the API and it has seen no work since, as otherwise other daemons, such as BIRD, are way more capable. So the focus is on programmatic change.

longmalx commented 1 year ago

Thanks. I guess what I'm not following is why the router being offline would affect the behavior if this is caused by the intrinsic handling of API vs configuration routes. I will have a look into this in more details later this week to see if I can work out any further hints.

thomas-mangin commented 1 year ago

Sorry, I misread the information when first replying, and agree that it is not right. Please discard my answer above.

longmalx commented 1 year ago

I think the problem is that this code in peer.py is never run to action the configuration changes if the neighbor is offline:

                    # we are here following a configuration change
                    if self._neighbor:
                        # see what changed in the configuration
                        self.neighbor.rib.outgoing.replace(self._neighbor.backup_changes, self._neighbor.changes)
                        # do not keep the previous routes in memory as they are not useful anymore
                        self._neighbor.backup_changes = []

I thought I'd tried reconnecting the neighbor and this didn't resolve the RIB, but the code looks like it should.

thomas-mangin commented 1 year ago

Porting the patch to the main branch does cause a regression with the test suite there, so I can not close this issue or release a new 4.2 version until it can be investigated.