Exa-Networks / exabgp

The BGP swiss army knife of networking
Other
2.06k stars 441 forks source link

Update messages are missing Next Hop attribute intermittently #1153

Closed gitneep closed 1 year ago

gitneep commented 1 year ago

Hi, can anyone think of a reason why on BGP UPDATE messages, half the time the Next Hop attribute is included and the other half it isn’t, for prefixes that are otherwise completely the same apart from the actual numbers and go out during the same group of updates upon restarting a BGP session? They also both get shown with announce route [prefix] next-hop self in the log, but then when looking in the pcap using tcpdump, one has:

        Update Message (2), length: 81
          Origin (1), length: 1, Flags [T]: IGP
          AS Path (2), length: 6, Flags [T]: 64512
          Updated routes:
          [removed]

And the next has:

        Update Message (2), length: 88
          Origin (1), length: 1, Flags [T]: IGP
          AS Path (2), length: 6, Flags [T]: 64512
          Next Hop (3), length: 4, Flags [T]: [removed]
          Updated routes:
          [removed]

This is causing problems on a Cisco IOS XR router:

RP/0/RSP0/CPU0:Mar 28 13:56:22.332 : bgp[1046]: [default-rtr] (ip4u): UPDATE from [removed], prefix [removed] (path ID: none) DENIED due to:
RP/0/RSP0/CPU0:Mar 28 13:56:22.332 : bgp[1046]: [default-rtr] (ip4u):  malformed update 'treat-as-withdraw';

Subsequently the prefixes are dropped and not installed in the routing table. The prefixes that are included in the updates that do have the Next Hop attribute listed however /are/ included.

This is with python3-exabgp version 4.2.21 on Ubuntu 20.04, and the Cisco is running IOS XR 5.1.3 (I know it's old but this will have to do for now).

Running exabgp -d yields the following for one of the prefixes that suffers from the missing Next Hop attribute as seen by tcpdump:

13:56:17 | 1922630 | process         | command from process service-watchdog : announce route [removed] next-hop self
13:56:17 | 1922630 | reactor         | async | service-watchdog | announce route [removed] next-hop self
13:56:17 | 1922630 | configuration   | . route            | '[removed]' 'next-hop' 'self'
13:56:17 | 1922630 | api             | route added to neighbor [removed] local-ip [removed] local-as 64512 peer-as [removed] router-id [removed] family-allowed in-open, neighbor [removed] local-ip [removed] local-as 64512 peer-as [removed] router-id [removed] family-allowed in-open, router-id [removed] family-allowed in-open : [removed] next-hop self

Run an IGP BGP peering session with 1000+ prefixes announced, catch everything on port 179 with tcpdump -v and look for the UPDATE messages, then notice that some have Next Hop and some don't.

Expected behavior would be that the Next Hop attribute is included for all prefixes, and not for some and not for others, as part of the same group of update messages.

thomas-mangin commented 1 year ago

It needs some investigation, but with a finger in the air, I would suggest an issue with next-hop self - if so using an IP should hide the issue until it can be fixed.

thomas-mangin commented 1 year ago

Changed the code to always convert "self" to the "IP" upon parsing the configuration.

gitneep commented 1 year ago

This worked, getting all the prefixes announced with Next Hop now, thanks for the fast response!