Open fedepaol opened 4 months ago
Full running config of 172.18.0.5 (the receiver)
####### Show running config
Building configuration...
Current configuration:
!
frr version 10.0.1_git
frr defaults traditional
hostname cd1e47ac22ac
log file /tmp/frr.log
log timestamp precision 3
!
debug zebra nht
debug bgp neighbor-events
debug bgp nht
debug bgp updates in
debug bgp updates out
debug bfd peer
!
password zebra
!
router bgp 4200000000
bgp router-id 172.18.0.5
no bgp ebgp-requires-policy
no bgp default ipv4-unicast
no bgp network import-check
neighbor 172.18.0.2 remote-as 64512
neighbor 172.18.0.3 remote-as 64512
neighbor 172.18.0.4 remote-as 64512
!
address-family ipv4 unicast
neighbor 172.18.0.2 activate
neighbor 172.18.0.2 next-hop-self
neighbor 172.18.0.3 activate
neighbor 172.18.0.3 next-hop-self
neighbor 172.18.0.4 activate
neighbor 172.18.0.4 next-hop-self
exit-address-family
exit
!
route-map RMAP permit 10
set ipv6 next-hop prefer-global
exit
!
end
Running config for the instance running on the cluster
logs of the instance running on the cluster frrlogs.txt
route-map 172.18.0.5-out permit 1
match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
on-match next
set local-preference 100
exit
!
route-map 172.18.0.5-out permit 2
match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
on-match next
set local-preference 100
exit
Two identical match/set with different sequence numbers?
Btw, could you please enable debug route-map detail
? It would at least tell what's going on a bit more.
Also, I'm trying to understand the use case here with on-match next
. Could you drop it and test without on-match next
, because you don't use it at all (as it should be).
route-map 172.18.0.5-out permit 1 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes on-match next set local-preference 100 exit ! route-map 172.18.0.5-out permit 2 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes on-match next set local-preference 100 exit
Two identical match/set with different sequence numbers?
Ugh, I guess we can remove them
Btw, could you please enable
debug route-map detail
? It would at least tell what's going on a bit more.
Yup, will do and re-run
Also, I'm trying to understand the use case here with
on-match next
. Could you drop it and test withouton-match next
, because you don't use it at all (as it should be).
The idea (maybe I am mis-using it) is to apply the local pref and then validate the ip later on to the following maps. I need it in case I want also to apply a community to the same prefixes
@ton31337 here we go:
2024/07/12 15:53:58.328 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.0.0/24, selected=0x7f56a3a81330
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-100-ipv4-localpref-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:100-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-150-ipv4-localpref-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:101-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-pl-ipv4 specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [KZK58-6T4Y6] No best match sequence for pfx: 192.168.0.0/24 in route-map: 172.30.0.2-out, result: no match
Show running config and full log attached : showrunning.txt.gz
I checked only the first, seem to be there, so I'd say it's not related to the redundant match (although it makes sense to remove it).
Is this happening during frr-reload, on boot, container restart?
Definitely during reload. It's a battery of e2e tests, where on each test we run a different configuration and do some assertions.
Could it be possible to have a full reload log? frr-reload.py --debug --reload /etc/frr/frr.conf. I'd like to understand what actually triggers this flake...
Here's another run:
This is the neighbor complaining about missing routes (172.18.0.5)
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*>i192.168.2.0/24 172.18.0.2 0 100 0 i
*=i 172.18.0.4 0 100 0 i
Displayed 1 routes and 2 total paths
This is 172.18.0.3:
Instance default:
BGP table version is 2, local router ID is 172.18.0.3, vrf id 0
Default local pref 100, local AS 64512
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 192.168.2.0/24 0.0.0.0 0 32768 i
*> 192.169.2.0/24 0.0.0.0 0 32768 i
Displayed 2 routes and 2 total paths
ip prefix-list 172.18.0.5-pl-ipv4 seq 1 permit 192.168.2.0/24
Still have the filter
2024/07/23 15:29:13.553 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.2.0/24 is filtered by route-map '172.18.0.5-out'
Last reload event: lastreload.txt.gz
Sequence of reloads in the last 10 minutes reload.txt.gz
Every reload sequence starts with
Caught SIGHUP and acquired lock! Reloading FRR..
and ends with FRR reloaded successfully! 2 seconds
Every cycle, we do a frr-reload.py --test
first and a frr-reload.py --overwrite
later.
As discussed on slack, I attached the logs of the reloader and the generated configuration when running with 9.1 (the generated config is the same regardless).
First, this is the wrong sequence of operations:
ip prefix-list 172.30.0.2-pl-ipv4 seq 1 permit 192.168.2.0/24
ip prefix-list 172.30.0.2-pl-ipv4 seq 2 permit 192.169.2.0/24
route-map 172.30.0.2-out permit 1
match ip address prefix-list 172.30.0.2-pl-ipv4
route-map 172.30.0.2-out permit 2
match ipv6 address prefix-list 172.30.0.2-pl-ipv4
ipv6 prefix-list 172.30.0.2-pl-ipv4 seq 3 deny any <<<<<<<<<<<<<<<<<
ipv6 prefix-list ...
should be created before route-map
statements to avoid noisy logs like below.
Looking at the logs you provided, it's not clear which AFI triggered this output (I assume IPv6):
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:100-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH
Could you move ipv6 prefix-list ...
creation above route-maps and re-test. I'd like to get rid of any false positives.
Also, could you enable log commands
to see what actual commands were executed and in which order.
Sorry for the extreme delay @ton31337 , here I am with the additional information:
Here we have 172.18.0.5
not getting routes from 172.18.0.2
.
From the logs the usual:
2024/09/12 16:01:41.695 BGP: [WJC21-WPGZF] u109:s109 announcing routes to 172.18.0.5, combined into 1 peers
2024/09/12 16:01:41.695 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.2.0/24, selected=0x7fb5d00e47d0
2024/09/12 16:01:41.695 BGP: [KZK58-6T4Y6] No best match sequence for pfx: 192.168.2.0/24 in route-map: 172.18.0.5-out, result: no match
2024/09/12 16:01:41.695 BGP: [H5AW4-JFYQC] Route-map: 172.18.0.5-out, prefix: 192.168.2.0/24, result: deny
2024/09/12 16:01:41.695 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.2.0/24 is filtered by route-map '172.18.0.5-out'
I applied the requested changes (define the prefix lists before the maps and adding log commands in the configuration):
Attaching a tgz with:
Also, note that in a few reloads I am seeing
BGP is already running; AS is 64512
line 684: Failure to communicate[13] to bgpd, line: router bgp 64513 vrf red
line 685: Warning[4]...: early exit from config file
[2101|bgpd] Configuration file[/etc/frr/frr.conf] processing failure: 13
[2094|mgmtd] done
2024-09-12 15:57:51,557 [91mWARNING[0m: frr-reload.py failed due to
vtysh (exec file) exited with status 13
Not sure it's related or not. I am going to attach also a file with more reload events (reload_full.txt) which contains them too.
Description
This is happening in https://github.com/metallb/frr-k8s/pull/166 , where I am bumping the version used in frr-k8s from 9.1 to 10.0.1. The FRR-K8s CI is quite stable, while after the bump I am seeing a few flakes.
172.18.0.2 tries to announce 192.168.0.0/24 and 172.18.0.5 is not receiving it
I am going to attach the full configuration, but adding here the relevant snippets:
Neighbor route map:
Prefix lists:
Route maps:
Show bgp ipv4 on 172.18.0.2
Show bgp ipv4 on 172.18.0.5:
(as we can see some other instances were able to push the routes.
The logs say the prefix is being filtered by the route-map:
Version
How to reproduce
It's a flake, the same test sometimes passes and some other time it does not. I have the feeling that it is associated to prefixlist with community / local preferences. On the other hand, it happens often enough to be able to provide the right dump commands after a test fails.
Expected behavior
The neighbor should receive the prefixes
Actual behavior
The instance running on the cluster is not advertising the prefixes.
Additional context
No response
Checklist