Exa-Networks / exabgp

The BGP swiss army knife of networking
Other
2.1k stars 447 forks source link

Exabgp with down healthcheck is sending an update on every check response #1241

Open TheMysteriousX opened 1 month ago

TheMysteriousX commented 1 month ago

Bug Report

Describe the bug

We have Exabgp on a number of hosts, using the healthcheck.py script for detecting if the service is online or not. A number are intentionally down at any given time as the application runs active/passive.

When the healthcheck starts returning down, we're seeing Exabgp send an update message upstream on every health check execution.

This doesn't occur on active hosts - it sends an update when the state transitions and that's it, which makes me think this is either unintended behaviour, or I screwed up the config.

It's not a massive issue, but it does put avoidable load of the upstream router.

    1 0.000000000  10.0.0.1 → 10.0.0.0  BGP 101 UPDATE Message
    2 0.102721118  10.0.0.0 → 10.0.0.1  TCP 74 179 → 45377 [ACK] Seq=1 Ack=28 Win=16384 Len=0 MD5
    3 5.030728754  10.0.0.1 → 10.0.0.0  BGP 101 UPDATE Message
    4 5.133023966  10.0.0.0 → 10.0.0.1  TCP 74 179 → 45377 [ACK] Seq=1 Ack=55 Win=16384 Len=0 MD5
    5 10.060778144  10.0.0.1 → 10.0.0.0  BGP 101 UPDATE Message
    6 10.163184629  10.0.0.0 → 10.0.0.1  TCP 74 179 → 45377 [ACK] Seq=1 Ack=82 Win=16384 Len=0 MD5
    7 14.451585443  10.0.0.0 → 10.0.0.1  BGP 93 KEEPALIVE Message
    8 14.451638920  10.0.0.1 → 10.0.0.0  TCP 74 45377 → 179 [ACK] Seq=82 Ack=20 Win=251 Len=0 MD5
    9 15.090508718  10.0.0.1 → 10.0.0.0  BGP 101 UPDATE Message
   10 15.193626499  10.0.0.0 → 10.0.0.1  TCP 74 179 → 45377 [ACK] Seq=20 Ack=109 Win=16384 Len=0 MD5

To Reproduce

I've removed one other neighbour config (IPv6, the application handles it differently so there's no healthcheck on that).

Addresses have been changed, but have been changed deterministically.

template {
    neighbor app {
        router-id "10.0.0.1";
        peer-as "4200000007";
        local-as "4200000006";
        md5-password "password";
    }
}

process watch {
    run python -m exabgp healthcheck --cmd "/usr/bin/check-port --port 18622 --transport udp" --ip "10.220.6.0/24" --no-ip-setup --withdraw-on-down --next-hop "10.0.0.1";
    encoder text;
}

neighbor 10.0.0.0 {
    inherit app;

    local-address 10.0.0.1;

    capability {
        graceful-restart;
    }

    family {
        ipv4 unicast;
    }

    api {
        processes [ watch ];
    }
}

Expected behavior

Updates should be sent on state transition, not on every failure.

Environment (please complete the following information):

Additional context

21:33:48 | 119157 | welcome       | Thank you for using ExaBGP
21:33:48 | 119157 | version       | 4.2.21
21:33:48 | 119157 | interpreter   | 3.9.18 (main, Aug 23 2024, 00:00:00)  [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
21:33:48 | 119157 | os            | Linux host 5.14.0-427.37.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Sep 13 12:41:50 EDT 2024 x86_64
21:33:48 | 119157 | installation  | /usr
21:33:48 | 119157 | cli           | could not find the named pipes (exabgp.in and exabgp.out) required for the cli
21:33:48 | 119157 | cli           | we scanned the following folders (the number is your PID):
21:33:48 | 119157 | cli control   |  - /run/exabgp/
21:33:48 | 119157 | cli control   |  - /run/0/
21:33:48 | 119157 | cli control   |  - /run/
21:33:48 | 119157 | cli control   |  - /var/run/exabgp/
21:33:48 | 119157 | cli control   |  - /var/run/0/
21:33:48 | 119157 | cli control   |  - /var/run/
21:33:48 | 119157 | cli control   |  - /usr/run/exabgp/
21:33:48 | 119157 | cli control   |  - /usr/run/0/
21:33:48 | 119157 | cli control   |  - /usr/run/
21:33:48 | 119157 | cli control   |  - /usr/var/run/exabgp/
21:33:48 | 119157 | cli control   |  - /usr/var/run/0/
21:33:48 | 119157 | cli control   |  - /usr/var/run/
21:33:48 | 119157 | cli control   | please make them in one of the folder with the following commands:
21:33:48 | 119157 | cli control   | > mkfifo /var/log/run/exabgp.{in,out}
21:33:48 | 119157 | cli control   | > chmod 600 /var/log/run/exabgp.{in,out}
21:33:48 | 119157 | configuration | performing reload of exabgp 4.2.21
21:33:48 | 119157 | configuration | > template         |
21:33:48 | 119157 | configuration | > neighbor         | 'app'
21:33:48 | 119157 | configuration | . router-id        | '10.0.0.1'
21:33:48 | 119157 | configuration | . peer-as          | '4200000007'
21:33:48 | 119157 | configuration | . local-as         | '4200000006'
21:33:48 | 119157 | configuration | . md5-password     | 'password'
21:33:48 | 119157 | configuration | < neighbor         |
21:33:48 | 119157 | configuration | < template         |
21:33:48 | 119157 | configuration | > process          | 'watch'
21:33:48 | 119157 | configuration | . run              | 'python' '-m' 'exabgp' 'healthcheck' '--cmd' '/usr/bin/check-port --port 18622 --transport udp' '--ip' '10.220.6.0/24' '--no-ip-setup' '--withdraw-on-down' '--next-hop' '10.0.0.1'
21:33:48 | 119157 | configuration | . encoder          | 'text'
21:33:48 | 119157 | configuration | < process          |
21:33:48 | 119157 | configuration | > neighbor         | '10.0.0.0'
21:33:48 | 119157 | configuration | . inherit          | 'app'
21:33:48 | 119157 | configuration | . local-address    | '10.0.0.1'
21:33:48 | 119157 | configuration | > capability       |
21:33:48 | 119157 | configuration | . graceful-restart |
21:33:48 | 119157 | configuration | < capability       |
21:33:48 | 119157 | configuration | > family           |
21:33:48 | 119157 | configuration | . ipv4             | 'unicast'
21:33:48 | 119157 | configuration | < family           |
21:33:48 | 119157 | configuration | > api              |
21:33:48 | 119157 | configuration | . processes        | '[' 'watch' ']'
21:33:48 | 119157 | configuration | < api              |
21:33:48 | 119157 | configuration | < neighbor         |
21:33:48 | 119157 | configuration | > neighbor         | '2002::1'
21:33:48 | 119157 | configuration | . inherit          | 'app'
21:33:48 | 119157 | configuration | . local-address    | '2002::9'
21:33:48 | 119157 | configuration | > capability       |
21:33:48 | 119157 | configuration | . graceful-restart |
21:33:48 | 119157 | configuration | < capability       |
21:33:48 | 119157 | configuration | > family           |
21:33:48 | 119157 | configuration | . ipv6             | 'unicast'
21:33:48 | 119157 | configuration | < family           |
21:33:48 | 119157 | configuration | > announce         |
21:33:48 | 119157 | configuration | > ipv6             |
21:33:48 | 119157 | configuration | . unicast          | '2002:f805::/64' 'next-hop' '2002::9'
21:33:48 | 119157 | configuration | < ipv6             |
21:33:48 | 119157 | configuration | < announce         |
21:33:48 | 119157 | configuration | < neighbor         |
21:33:48 | 119157 | reactor       | new peer: neighbor 10.0.0.0 local-ip 10.0.0.1 local-as 4200000006 peer-as 4200000007 router-id 10.0.0.1 family-allowed in-open
21:33:48 | 119157 | reactor       | new peer: neighbor 2002::1 local-ip 2002::9 local-as 4200000006 peer-as 4200000007 router-id 10.0.0.1 family-allowed in-open
21:33:48 | 119157 | reactor       | loaded new configuration successfully
21:33:48 | 119157 | process       | forked process watch
21:33:48 | 119157 | reactor       | initialising connection to peer-2
21:33:48 | 119157 | reactor       | initialising connection to peer-1
21:33:48 | 119157 | outgoing-1    | attempting connection to 2002::1:179
21:33:48 | 119157 | outgoing-1    | attempting connection to 10.0.0.0:179
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  59) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 0104 5BA0 00B4 C251 2F09 1E02 0601 0400 0200 0102 0641 04FA 56EA 0602 0840 0680 B400 0201 8002 0206 00
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  59) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003B 0104 5BA0 00B4 C251 2F09 1E02 0601 0400 0100 0102 0641 04FA 56EA 0602 0840 0680 B400 0101 8002 0206 00
21:33:48 | 119157 | outgoing-1    | >> OPEN version=4 asn=23456 hold_time=180 router_id=10.0.0.1 capabilities=[Multiprotocol(ipv6 unicast), Extended Message(65535), Graceful Restart Flags 0x8 Time 180 ipv6/unicast=0x80, ASN4(4200000006)]
21:33:48 | 119157 | ka-outgoing-1 | receive-timer 60 second(s) left
21:33:48 | 119157 | outgoing-1    | >> OPEN version=4 asn=23456 hold_time=180 router_id=10.0.0.1 capabilities=[Multiprotocol(ipv4 unicast), Extended Message(65535), Graceful Restart Flags 0x8 Time 180 ipv4/unicast=0x80, ASN4(4200000006)]
21:33:48 | 119157 | ka-outgoing-1 | receive-timer 60 second(s) left
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003F 01
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  44) 045B A000 5A0A D50A 0122 0206 0104 0002 0001 0202 8000 0202 0200 0204 4002 4078 0206 4104 FA56 EA07 0202 4700
21:33:48 | 119157 | outgoing-1    | << message of type OPEN
21:33:48 | 119157 | outgoing-1    | << OPEN version=4 asn=23456 hold_time=90 router_id=10.213.10.1 capabilities=[Multiprotocol(ipv6 unicast), Route Refresh, Graceful Restart Flags 0x4 Time 120 , ASN4(4200000007), Unassigned 71, Route Refresh]
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 003F 01
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  44) 045B A000 5A0A D50A 0122 0206 0104 0001 0001 0202 8000 0202 0200 0204 4002 4078 0206 4104 FA56 EA07 0202 4700
21:33:48 | 119157 | outgoing-1    | << message of type OPEN
21:33:48 | 119157 | outgoing-1    | << OPEN version=4 asn=23456 hold_time=90 router_id=10.213.10.1 capabilities=[Multiprotocol(ipv4 unicast), Route Refresh, Graceful Restart Flags 0x4 Time 120 , ASN4(4200000007), Unassigned 71, Route Refresh]
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
21:33:48 | 119157 | outgoing-1    | >> KEEPALIVE (OPENCONFIRM)
21:33:48 | 119157 | outgoing-1    | >> KEEPALIVE (OPENCONFIRM)
21:33:48 | 119157 | ka-outgoing-1 | receive-timer 90 second(s) left
21:33:48 | 119157 | ka-outgoing-1 | receive-timer 90 second(s) left
WARNING[healthcheck] Check command was unsuccessful: 1
INFO[healthcheck] Output of check command: b'Port 18622 is not being listened on\n'
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
21:33:48 | 119157 | outgoing-1    | << message of type KEEPALIVE
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0013 04
21:33:48 | 119157 | outgoing-1    | << message of type KEEPALIVE
21:33:48 | 119157 | reactor       | connected to peer-1 with outgoing-1 10.0.0.1-10.0.0.0
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0017 02
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (   4) 0000 0000
21:33:48 | 119157 | outgoing-1    | << message of type UPDATE
21:33:48 | 119157 | parser        | parsing UPDATE (   4) 0000 0000
21:33:48 | 119157 | peer-1        | << UPDATE #1
21:33:48 | 119157 | peer-1        |    UPDATE #1 nlri  (   4) eor 1/1 (ipv4 unicast)
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  23) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0017 0200 0000 00
21:33:48 | 119157 | reactor       | connected to peer-2 with outgoing-1 2002::9-2002::1
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  19) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001E 02
21:33:48 | 119157 | outgoing-1    | received complete TCP payload (  11) 0000 0007 900F 0003 0002 01
21:33:48 | 119157 | outgoing-1    | << message of type UPDATE
21:33:48 | 119157 | parser        | parsing UPDATE (  11) 0000 0007 900F 0003 0002 01
21:33:48 | 119157 | peer-2        | << UPDATE #1
21:33:48 | 119157 | peer-2        |    UPDATE #1 nlri  (  11) eor 2/1 (ipv6 unicast)
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  69) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 0045 0200 0000 2E40 0101 0040 0206 0201 FA56 EA06 800E 1E00 0201 1020 0106 3000 3C10 4500 0000 0000 0000 0900 4020 0106 3000 3CF8 05
21:33:48 | 119157 | outgoing-1    | >> 1 UPDATE(s)
21:33:48 | 119157 | outgoing-1    | >> EOR ipv4 unicast
21:33:48 | 119157 | outgoing-1    | sending TCP payload (  30) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001E 0200 0000 0790 0F00 0300 0201
21:33:48 | 119157 | peer-1        | >> EOR(s)
21:33:48 | 119157 | outgoing-1    | >> EOR ipv6 unicast
21:33:48 | 119157 | peer-2        | >> EOR(s)
21:33:49 | 119157 | ka-outgoing-1 | receive-timer 89 second(s) left
21:33:49 | 119157 | ka-outgoing-1 | send-timer 29 second(s) left
21:33:49 | 119157 | ka-outgoing-1 | receive-timer 89 second(s) left
21:33:49 | 119157 | ka-outgoing-1 | send-timer 29 second(s) left
WARNING[healthcheck] Check command was unsuccessful: 1
INFO[healthcheck] Output of check command: b'Port 18622 is not being listened on\n'
21:33:50 | 119157 | ka-outgoing-1 | receive-timer 88 second(s) left
21:33:50 | 119157 | ka-outgoing-1 | send-timer 28 second(s) left
21:33:50 | 119157 | ka-outgoing-1 | receive-timer 88 second(s) left
21:33:50 | 119157 | ka-outgoing-1 | send-timer 28 second(s) left
WARNING[healthcheck] Check command was unsuccessful: 1
INFO[healthcheck] Output of check command: b'Port 18622 is not being listened on\n'
INFO[healthcheck] send announces for DOWN state to ExaBGP
21:33:50 | 119157 | process       | command from process watch : withdraw route 10.220.6.0/24 next-hop 10.0.0.1
21:33:50 | 119157 | reactor       | async | watch | withdraw route 10.220.6.0/24 next-hop 10.0.0.1
21:33:50 | 119157 | configuration | . route            | '10.220.6.0/24' 'next-hop' '10.0.0.1'
21:33:50 | 119157 | api           | route removed from neighbor 10.0.0.0 local-ip 10.0.0.1 local-as 4200000006 peer-as 4200000007 router-id 10.0.0.1 family-allowed in-open : 10.220.6.0/24 next-hop 10.0.0.1
21:33:50 | 119157 | process       | responding to watch : done
21:33:51 | 119157 | ka-outgoing-1 | receive-timer 87 second(s) left
21:33:51 | 119157 | ka-outgoing-1 | send-timer 27 second(s) left
21:33:51 | 119157 | ka-outgoing-1 | receive-timer 87 second(s) left
21:33:51 | 119157 | ka-outgoing-1 | send-timer 27 second(s) left
21:33:52 | 119157 | ka-outgoing-1 | receive-timer 86 second(s) left
21:33:52 | 119157 | ka-outgoing-1 | send-timer 26 second(s) left
21:33:52 | 119157 | ka-outgoing-1 | receive-timer 86 second(s) left
21:33:52 | 119157 | ka-outgoing-1 | send-timer 26 second(s) left
21:33:53 | 119157 | ka-outgoing-1 | receive-timer 85 second(s) left
21:33:53 | 119157 | ka-outgoing-1 | send-timer 25 second(s) left
21:33:53 | 119157 | ka-outgoing-1 | receive-timer 85 second(s) left
21:33:53 | 119157 | ka-outgoing-1 | send-timer 25 second(s) left
21:33:54 | 119157 | ka-outgoing-1 | receive-timer 84 second(s) left
21:33:54 | 119157 | ka-outgoing-1 | send-timer 24 second(s) left
21:33:54 | 119157 | ka-outgoing-1 | receive-timer 84 second(s) left
21:33:54 | 119157 | ka-outgoing-1 | send-timer 24 second(s) left
21:33:55 | 119157 | ka-outgoing-1 | receive-timer 83 second(s) left
21:33:55 | 119157 | ka-outgoing-1 | send-timer 23 second(s) left
21:33:55 | 119157 | ka-outgoing-1 | receive-timer 83 second(s) left
21:33:55 | 119157 | ka-outgoing-1 | send-timer 23 second(s) left
WARNING[healthcheck] Check command was unsuccessful: 1
INFO[healthcheck] Output of check command: b'Port 18622 is not being listened on\n'
INFO[healthcheck] send announces for DOWN state to ExaBGP
21:33:55 | 119157 | process       | command from process watch : withdraw route 10.220.6.0/24 next-hop 10.0.0.1
21:33:55 | 119157 | reactor       | async | watch | withdraw route 10.220.6.0/24 next-hop 10.0.0.1
21:33:55 | 119157 | configuration | . route            | '10.220.6.0/24' 'next-hop' '10.0.0.1'
21:33:55 | 119157 | api           | route removed from neighbor 10.0.0.0 local-ip 10.0.0.1 local-as 4200000006 peer-as 4200000007 router-id 10.0.0.1 family-allowed in-open : 10.220.6.0/24 next-hop 10.0.0.1
21:33:55 | 119157 | process       | responding to watch : done
21:33:55 | 119157 | outgoing-1    | sending TCP payload (  27) FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF 001B 0200 0418 0ADC 0600 00
21:33:55 | 119157 | outgoing-1    | >> 1 UPDATE(s)
thomas-mangin commented 1 month ago

I believe that if you enable adj-rib out on the peer, exabgp should behave as you wish.

TheMysteriousX commented 1 month ago

If you mean the upstream, it's a Juniper device which I think only operates with (the equivalent of) adj-rib-in/out.

On exabgp it looks like it defaults to on - I set in the config anyway and toggled it between false/true, but didn't see any change in the rate of update messages sent upstream.

# exabgpcli show neighbor configuration 10.0.0.0
neighbor 10.0.0.0 {
  description "";
  router-id 10.0.0.1;
  host-name host;
  domain-name host;
  local-address 10.0.0.1;
  local-as 4200000006;
  peer-as 4200000007;
  hold-time 180;
  rate-limit disable;
  manual-eor false;

  passive false;
  group-updates true;
  auto-flush true;
  adj-rib-in true;
  adj-rib-out true;
  md5-password "password";
  md5-base64 auto;
  md5-ip "10.0.0.1";

  capability {
    asn4 enable;
    route-refresh disable;
    graceful-restart 180;
    nexthop disable;
    add-path disable;
    multi-session disable;
    operational disable;
    aigp disable;
  }
  family {
    ipv4 unicast;
  }
  nexthop {
  }
  add-path {
  }
  api {
    processes [ watch ];
  }
}
thomas-mangin commented 3 days ago

I need to look at the code, but we allow users to remove routes that exabgp did not originate. It may be more of a feature than a bug.