Exa-Networks / exabgp

The BGP swiss army knife of networking
Other
2.05k stars 440 forks source link

Too frustrating: converting from ExaBGP 3.4 to 4.2.21 #1179

Open he32 opened 9 months ago

he32 commented 9 months ago
[exabgp.daemon]
daemonize = true

[exabgp.log]
destination = syslog
short = true
level = WARNING

and adding the new

[exabgp.api]
ack = false

because I use the healthcheck module and I don't think it consumes any input, I earlier got only

Sep 18 19:09:54 oliven -: healthcheck-resolver[17649]: send announces for EXIT state to ExaBGP

in the log. Nothing else.

Commenting out "daemonize" and "destination" and running with -d gets more detailed information: it turns out that /var/run/exabgp/exabgp.in and exabgp.out were missing, they apparently need to exist as named pipes and be writable by nobody.

announce route a.b.c.d/32 next-hop self med 100
announce route 2001:x:0:y:z::1/128 next-hop self med 101

This used to work nicely with ExaBGP 3.4, but apparently will not with ExaBGP 4.2.21. Running ExaBGP in the foreground with debug reveals

command from process recursor : announce route a.b.c.d/32 next-hop self med 100 
async | recursor | announce route a.b.c.d/32 next-hop self med 100
no neighbor matching the command : announce route a.b.c.d/32 next-hop self med 100

So... ExaBGP is now bone-headed enough to insist on an explicit neighbor specification from the healthcheck script, even if there is only a single address-family-matching BGP session? That makes it impossible to share the healthcheck config file between installations, because the neighbor address (which is "of course" site-specific) creeps over from the ExaBGP config file itself into the healthcheck configuration file, which is unfortunate.

  --neighbor NEIGHBOR   advertise the route to the selected neigbors

However, how NEIGHBOR goes from singular to plural (neighbors) is not explained. What is the syntax for supplying multiple NEIGHBORs?

[exabgp.api]
ack = false

That has the unfortunate side-effect that exabgpcli gets partially stuck after executing a command, as has been noted already in #1164.

So, sorry, more of a rant to vent my frustration of the process, which is so far not finished.

I've already read https://github.com/Exa-Networks/exabgp/wiki/Migration-from-3.4-to-4.0

I also tried (and failed) to make sense of https://github.com/Exa-Networks/exabgp/wiki/Configuration-:-Process

because this is, IMHO, to be charitable "light on explanation of actual semantics / behaviour modifications". I think I figured out that ExaBGP in 4.x needs "encoder text" added to the process section if using the "healthcheck" module, but that is by now more than a guess than anything else.

Answers or hints would be greatly appreciated.

thomas-mangin commented 9 months ago

You have quite eloquently expressed how bad I am at being responsible for the maintenance of this code and I will gladly accept some help: I gave you write access to the repository years ago.

neighbour * should only require a few lines of changes if it does not already work.

Now if you wish to know why these decisions were made, even bone-headed ones, happy to explain.

Many came from trying to be nice to users, a mistake I do not do as much anymore, and trying to not add too much complexity to a code base which was authored in 2009, before async, and is way from ideal and would require weeks of work to bring to modern python standards and even more to the level of quality of rustyBGP.

Some people find useful and I am doing my best to keep it working.

I worked on many prototypes over the years in one in Go - with some brain dead ideas, one V - which has decoding and the start of a Yang based cli but V is not suitable to make this serious, and even within this repo for my first attempt to Yang support. But I would rather rewrite the whole code base in Odin.. or perhaps wait to see Jai released and if I do.

I am also not as young as I used to be. I have a business to run which 50 people rely on for their living. I also need time to relax to remain healthy and time for my family so it is a case of https://xkcd.com/2347/

still open from you: https://github.com/Exa-Networks/exabgp/issues/1121

and some other motivational issue: https://github.com/Exa-Networks/exabgp/issues/1038

he32 commented 9 months ago

Well, now, I realize that was deserved, and I'm sorry if my rant rubbed you the wrong way.

neighbor * would be useful, I think, and would resolve at least three of the points I mentioned.

Getting write access to e.g. keep the exabgp.conf man page updated is one thing, decoding the actual supported grammar from the code is quite another, and me being a python newbie doesn't exactly help in that endavour. I'll see what I can come up with on that front.

thomas-mangin commented 9 months ago

I have removed some of the "marketing" blurb on the README. I will look at neighbor *

thomas-mangin commented 9 months ago

Support for neighbor *

thomas-mangin commented 9 months ago

I appreciate and appreciated your help over the years. I currently have code in 4.2.22 which I believe is not correct for all users (following the acceptance of a patch from a user) and can not release 4.2.22 without a rollback of his contribution or figuring out why my own testing led me to believe there is an issue but the main branch should be fine.

thomas-mangin commented 9 months ago

I am surprised to hear that normal announce did not work for you, I have a test to make sure it did:

❯ ./qa/bin/functional encoding 2     
 ✖ ✖ ✓ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖ ✖

❯ ./qa/bin/functional encoding --list

The available tests are:

 0  api-add-remove            L  api-no-respawn            g  conf-ipv46routes4family  
 1  api-announce-star         M  api-notification          h  conf-ipv46routes6family  
 2  api-announce              N  api-open                  i  conf-ipv6grouping        
 3  api-announcement          O  api-reload                j  conf-l2vpn               
 4  api-api                   P  api-rib                   k  conf-largecommunity      
 5  api-attributes-path       Q  api-rr-rib                l  conf-name                
 6  api-attributes-vpn        R  api-teardown              m  conf-new-v4              
 7  api-attributes            S  api-vpls                  n  conf-new-v6              
 8  api-broken-flow           T  api-vpnv4                 o  conf-no-asn4             
 9  api-check                 U  conf-addpath              p  conf-parity              
 A  api-eor                   V  conf-aggregator           q  conf-path-information    
 B  api-fast                  W  conf-attributes           r  conf-prefix-sid          
 C  api-flow                  X  conf-ebgp                 s  conf-split               
 D  api-ipv4                  Y  conf-extended-attributes  t  conf-srv6-mup            
 E  api-ipv6                  Z  conf-flow-redirect        u  conf-template            
 F  api-manual-eor            a  conf-flow                 v  conf-unknowncap          
 G  api-multi-neighbor        b  conf-generic-attribute    w  conf-vpn                 
 H  api-multiple-api          c  conf-group-limit          x  conf-watchdog            
 J  api-nexthop-self          e  conf-ipself4             
thomas-mangin commented 9 months ago

ah! you have two sessions, it may be why ...

he32 commented 9 months ago

Do you want to try this totally untested patch:

Will do first thing after tomorrows morning meeting.

Thanks!

he32 commented 9 months ago

Hm, the first obstacle I butted into was that healthcheck.py didn't want to accept neighbor *. I changed the argument type from ip_address to str to work around that. The second obstacle is that the neighbor spec is still not matching; exabgp (after an initial tweak to the debug output) says "no matching neighbors".

I'm wondering if limit.py should instead do:

def match_neighbor(description, name):
    for string in description:
        if string.strip() == 'neighbor *':

However, with that I still get no matching neighbors (with tweaked logging):

command from process recursor : neighbor * announce route 3.4.5.6/32 next-hop self med 100 
async | recursor | neighbor * announce route 3.4.5.6/32 next-hop self med 100
no neighbor matching the description(s) [['neighbor *']] for command : announce route 3.4.5.6/32 next-hop self med 100

So I expanded logging in announce.py to say

            descriptions, command = extract_neighbors(line)
            peers = match_neighbors(reactor.peers(service), descriptions)
            if not peers:
                self.log_failure('no neighbor matching the description(s) %s (neighbors %s, service %s) for command : %s' % (repr(descriptions), repr(reactor.peers(service)), repr(service), command))

and with that I get logged

command from process recursor : neighbor * announce route 3.4.5.6/32 next-hop self med 100 
async | recursor | neighbor * announce route 3.4.5.6/32 next-hop self med 100
no neighbor matching the description(s) [['neighbor *']] (neighbors [], service 'recursor') for command : announce route 3.4.5.6/32 next-hop self med 100

So... reactor.peers(service) doesn't appear to actually present the list of configured neighbors. Any idea what should be used instead? Or is this just a mis-configuration on my part? The service is the name of the process from exabgp.conf which injected this command. What do I use to couple the configured neighbors with the configured process(es)? The output from exabgp started with --validate does not provide any hints in that direction that I can see. Description? Sounds unlikely.

This, unfortunately, comes back to the documentation of exabgp.conf, or points to the massive confusion on my part.

longmalx commented 9 months ago

I appreciate and appreciated your help over the years. I currently have code in 4.2.22 which I believe is not correct for all users (following the acceptance of a patch from a user) and can not release 4.2.22 without a rollback of his contribution or figuring out why my own testing led me to believe there is an issue but the main branch should be fine.

Does this refer to this change (https://github.com/Exa-Networks/exabgp/pull/1128) ?

We have been running this for quite a while without issue. If you can add a ticket with the issues you see I can try and find time to help have a look to try and root cause what's happening.

he32 commented 9 months ago

Just for reference, here's a redacted exabgp.conf that I'm currently trying to make work with exabgp 4.2.21:

neighbor 158.38.x.y {
    peer-as 224;
    local-as 65053;
    router-id 158.38.x.z;
    local-address 158.38.x.z;
    family {
        ipv4 unicast;
    }
}
neighbor 2001:700:0::a {
    peer-as 224;
    local-as 65053;
    router-id 158.38.x.z;
    local-address 2001:700:0::b;
    family {
        ipv6 unicast;
    }
}
process recursor {
        run /usr/pkg/bin/python3.10 -m exabgp healthcheck --config /etc/exabgp/resolver-hc.conf;
        encoder text;
}

and my slightly redacted resolver-hc.conf looks like this:

# debug
name = resolver
interval = 5
fast-interval = 1
ip = 158.38.c.d
ip = 2001:700::a
neighbor = *
no-ip-setup
withdraw-on-down
command = /usr/local/bin/resolver-health-check.pl -c /etc/exabgp/resolver-checks.txt -r 127.0.0.1
pid = /var/run/exabgp/resolver-hc.pid
silent
he32 commented 9 months ago

More...

I've found lib/exabgp/reactor/loop.py which defines peers(self, service=''), and it's

    def peers(self, service=''):
        matching = []
        for peer_name, peer in self._peers.items():
            if service == '':
                matching.append(peer_name)
                continue
            if service.startswith('api-internal-cli'):
                matching.append(peer_name)
                continue
            if service in peer.neighbor.api['processes']:
                matching.append(peer_name)
                continue
        return matching

However, I have not been able to find out how anything ends up in peer.neighbor.api['processes'], much less how that is configured. I did find

        neighbor.api = ParseAPI.flatten(local.pop('api', {}))

but that is way too much magic in one sentence for me to decipher and translate into how that's configured.

I'm reasonably good with find ... grep, but appear to lack the requisite python knowledge. "Help!"

thomas-mangin commented 9 months ago

@he32 Sorry, I deleted the patch I posted here, added two commits on the same evening on the master and 4.2 branch with neighbor * and added some testing. All seemed to be working. I did not test the healthcheck program itself tho - but I modified it to use neighbor *

And for some reason I did not get emails from GH for your posts.

thomas-mangin commented 9 months ago

@longmalx oh, thank you for following issues!

Yes, I performed a test (I can not recall what) which showed that in some scenario something was off with what ended up in (or out of) the RIB - as I was in between things I did not document what caused it and was later not able to remember what it was 🤦

It may be a bug in the RIB code itself which is really ugly, right now I do not know. I am pleased to hear that you have no issue.

I wrote when merging "I have merged the patch but now wonder if there will be an issue with API routes? I need to check." .. it may have been what I checked ...

he32 commented 9 months ago

I did not test the healthcheck program itself tho - but I modified it to use neighbor *

Hmm... In my case the output from the healthcheck program looks reasonably sane:

neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101
neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101

So the logging I extended is inside exabgp itself (in announce_route() inside lib/exabgp/reactor/api/command/announce.py, and it's apparently a problem with knowing which neighbors to match the "neighbor *" against. The code uses the process name as "service" in reactor.peers(service), and that ends up as an empty list in my case, so "of course" exabgp complains that it can't find any matching neighbors.

And ... this goes back to the configuration which I posted above -- what (if anything) is missing to make the reactor.peers(service) where service == "recursor" return the configured neighbors? I.e. how do you configure so that the process is "related to" the configured neighbors? (If that is what's needed, that is.)

And for some reason I did not get emails from GH for your posts.

That part I cannot explain, sorry.

he32 commented 9 months ago

And ... this goes back to the configuration which I posted above -- what (if anything) is missing to make the reactor.peers(service) where service == "recursor" return the configured neighbors? I.e. how do you configure so that the process is "related to" the configured neighbors? (If that is what's needed, that is.)

Never mind, found what's needed to tie a neighbor to a process:

    api {
        processes [ recursor ];
    }

inside the neighbor specification.

The issue I then get stuck on is that when processing the two lines (which occur about the same time):

neighbor * announce route 158.38.0.x/32 next-hop self med 100
neighbor * announce route 2001:700:0::a/128 next-hop self med 101

The IPv4 route gets announced to the IPv4 neighbor, but the IPv6 route does not get announced to the IPv6 neighbor, because it first tries the IPv4 route to the IPv6 neighbor, and that won't work, and exabgp logs

the route family is not configured on neighbor

So, it stops doing anything on the first occurrance of that error for the neighbor?

Hm, that pushes me in the direction of two processes, one for IPv4, one for IPv6, which doubles the cost of the health-checking.

thomas-mangin commented 9 months ago

Thank you for this investigation, please give me a few days to look into fixing this.

thomas-mangin commented 9 months ago

I looked at the code and this string, while reported as an error, does not stop any processing. The route with the wrong family for the peer will be ignored. So if the neighbours have explicit ipv4 only, and ipv6 only, it will be verbose but should work as expected.

he32 commented 9 months ago

I looked at the code and this string, while reported as an error, does not stop any processing. The route with the wrong family for the peer will be ignored. So if the neighbours have explicit ipv4 only, and ipv6 only, it will be verbose but should work as expected.

That did not match with what I observed at the time on the neighboring BGP speaker (a Juniper router). However, I will re-test to confirm, and look a bit closer at the issue. To be continued.

thomas-mangin commented 8 months ago

@he32 I have changed the behaviour on error returning as it made sense, but the patch above may not resolve your issue. I am still looking into it.