Closed liuxyon closed 2 years ago
Can you provide at least a configuration?
or logs? This is pretty useless bug report.
2022/03/19 02:40:39 STATIC: [MRN6F-AYZC4] Terminating on signal 2022/03/19 02:40:39 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal 2022/03/19 02:40:39 ZEBRA: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 39: Connection reset by peer 2022/03/19 02:40:39 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down. 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 17 disconnected 141713 bgp routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 17 disconnected 0 bgp nhgs removed from the rib 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 32 disconnected 0 vnc routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 32 disconnected 0 vnc nhgs removed from the rib 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 39 disconnected 0 static routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 39 disconnected 0 static nhgs removed from the rib 2022/03/19 02:40:41 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0 2022/03/19 02:44:40 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All 2022/03/19 02:44:40 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fed1:ca1:b::11 due to change in addpath config 2022/03/19 02:44:43 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c1:22:a:1:: in vrf default 2022/03/19 02:46:40 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/19 02:46:42 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default 2022/03/19 05:08:03 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/19 05:08:05 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default
Some IP addresses are modified or hidden
! ! Zebra configuration saved from vty ! 2022/03/12 19:53:42 ! frr version 8.1 frr defaults traditional ! hostname sir log file /etc/frr/frr.log ! ! ! router bgp 29753 bgp router-id 134.196.121.55 no bgp ebgp-requires-policy no bgp default ipv4-unicast no bgp network import-check neighbor 2602:fed2:ca1:b::11 remote-as 65105 neighbor 2602:fed2:ca1:b::11 description "my local " neighbor 2602:fed2:ca1:b::11 disable-connected-check neighbor 2602:fed2:ca1:b::11 update-source wg1 neighbor 2602:fed2:ca1:b::11 advertisement-interval 0 neighbor 2602:fed2:ca1:b::11 disable-connected-check neighbor 2a09:5c0:fe0:8c::1 remote-as 68057 neighbor 2a09:5c0:fe0:8c::1 description tunnelbroke neighbor 2a09:5c0:fe0:8c::1 update-source AS68057 neighbor 2a09:5c0:fe0:8c::1 advertisement-interval 0 neighbor 2a09:5c0:fe0:8c::1 capability dynamic neighbor 2a09:5c0:fe0:8c::1 sender-as-path-loop-detection neighbor 2a0f:85c1:22:a:1:: remote-as 306628 neighbor 2a0f:85c1:22:a:1:: description "AS306628 " neighbor 2a0f:85c1:22:a:1:: disable-connected-check neighbor 2a0f:85c1:22:a:1:: update-source ens19 neighbor 2a0f:85c1:22:a:1:: capability dynamic ! address-family ipv4 unicast exit-address-family ! address-family ipv6 unicast network 2602:fed1:ca1::/48 neighbor 2602:fed1:ca1:b::11 activate neighbor 2602:fed1:ca1:b::11 addpath-tx-all-paths neighbor 2602:fed1:ca1:b::11 next-hop-self neighbor 2602:fed1:ca1:b::11 remove-private-AS all neighbor 2602:fed1:ca1:b::11 soft-reconfiguration inbound neighbor 2602:fed1:ca1:b::11 prefix-list mycn6out out neighbor 2a0f:85c3:22:a:1:: activate neighbor 2a0f:85c3:22:a:1:: remove-private-AS all neighbor 2a0f:85c3:22:a:1:: soft-reconfiguration inbound neighbor 2a0f:85c3:22:a:1:: prefix-list ipv6in in neighbor 2a0f:85c3:22:a:1:: prefix-list myv6out out neighbor 2a0f:85c3:22:a:1:: route-map A01 in neighbor 2a0f:85c3:22:a:1:: route-map 80 out exit-address-family ! exit ! ipv6 prefix-list ipv6in seq 105 deny ::1/128 ipv6 prefix-list ipv6in seq 110 deny ::/128 ipv6 prefix-list ipv6in seq 120 deny 3ffe::/16 le 128 ipv6 prefix-list ipv6in seq 130 deny 2001:db8::/32 le 128 ipv6 prefix-list ipv6in seq 140 deny 2001::/32 ipv6 prefix-list ipv6in seq 150 deny 2001::/32 le 128 ipv6 prefix-list ipv6in seq 160 permit 2002::/16 ipv6 prefix-list ipv6in seq 170 deny 2002::/16 le 128 ipv6 prefix-list ipv6in seq 180 deny ::/8 le 128 ipv6 prefix-list ipv6in seq 190 deny fe00::/9 le 128 ipv6 prefix-list ipv6in seq 200 deny ff00::/8 le 128 ipv6 prefix-list ipv6in seq 205 permit 2000::/3 le 48 ipv6 prefix-list ipv6in seq 900 deny ::/0 le 128 ipv6 prefix-list ipv6in seq 999 deny any ipv6 prefix-list mycn6out seq 5 deny ::1/128 ipv6 prefix-list mycn6out seq 10 deny ::/128 ipv6 prefix-list mycn6out seq 15 deny 3ffe::/16 le 128 ipv6 prefix-list mycn6out seq 20 deny 2001:db8::/32 le 128 ipv6 prefix-list mycn6out seq 25 deny 2001:10::/28 le 128 ipv6 prefix-list mycn6out seq 30 deny 2001:2::/48 le 128 ipv6 prefix-list mycn6out seq 35 deny 100::/64 le 128 ipv6 prefix-list mycn6out seq 40 deny ::/8 le 128 ipv6 prefix-list mycn6out seq 45 deny fc00::/7 le 128 ipv6 prefix-list mycn6out seq 50 deny ff00::/8 le 128 ipv6 prefix-list mycn6out seq 55 deny 2002::/16 le 128 ipv6 prefix-list mycn6out seq 60 deny ::/0 ge 49 le 128 ipv6 prefix-list mycn6out seq 110 permit 2000::/3 le 48 ipv6 prefix-list mycn6out seq 999 deny any ipv6 prefix-list myv6out seq 50 permit 2602:fed3:7021::/48 ipv6 prefix-list myv6out seq 100 permit 2602:fed1:ca1::/48 ipv6 prefix-list myv6out seq 999 deny any ! bgp as-path access-list 2 seq 5 deny ^([0-9]+)(\1)+$ bgp as-path access-list 2 seq 10 permit .* bgp as-path access-list 99 seq 5 permit (4294967[0-1][0-9][0-9])|(42949672[0-8][0-9])|(429496729[0-4]) bgp as-path access-list 99 seq 10 permit (42949[0-5][0-9][0-9][0-9][0-9])|(429496[0-6][0-9][0-9][0-9]) bgp as-path access-list 99 seq 15 permit (429[0-3][0-9][0-9][0-9][0-9][0-9][0-9])|(4294[0-8][0-9][0-9][0-9][0-9][0-9]) bgp as-path access-list 99 seq 20 permit (6449[6-9])|(6450[0-9])|(6451[0-1])|(6553[6-9])|(6554[0-9])|(6555[0-1])_ bgp as-path access-list 99 seq 25 permit 0 bgp as-path access-list 99 seq 30 permit 1310[0-6][0-9]|13107[0-1] bgp as-path access-list 99 seq 35 permit 23456 bgp as-path access-list 99 seq 40 permit 42[0-8][0-9][0-9][0-9][0-9][0-9][0-9][0-9] bgp as-path access-list 99 seq 45 permit 6(4(5(1[2-9]|[2-9][0-9])|[6-9][0-9][0-9])|5([0-4][0-9][0-9]|5([0-2][0-9]|3[0-5]))) bgp as-path access-list 99 seq 50 permit 6555[2-9]|655[6-9][0-9]|65[6-9][0-9][0-9]|6[6-9][0-9][0-9][0-9] bgp as-path access-list 99 seq 55 permit [7-9][0-9][0-9][0-9][0-9]|1[0-2][0-9][0-9][0-9][0-9]|130[0-9][0-9][0-9] ! ! route-map 80 permit 50 set local-preference 100 set metric 0 exit ! route-map A01 deny 11 match as-path 99 exit ! route-map A01 deny 20 match rpki invalid exit ! route-map A01 permit 25 match as-path 2 exit ! route-map A01 permit 30 match rpki notfound set local-preference 100 set metric 0 set as-path prepend last-as 1 exit ! route-map A01 permit 50 match rpki valid set local-preference 200 set metric 0 exit ! route-map 05 deny 20 match rpki invalid exit ! route-map 05 permit 30 match rpki notfound set metric 0 exit ! route-map 05 permit 50 match rpki valid set metric 0 exit ! route-map A02 deny 11 match as-path 99 exit ! route-map A02 deny 20 match rpki invalid exit ! route-map A02 permit 25 match as-path 2 exit ! route-map A02 permit 30 match rpki notfound set local-preference 100 set metric 0 set as-path prepend last-as 5 exit ! route-map A02 permit 50 match rpki valid set local-preference 100 set metric 0 set as-path prepend last-as 3 exit ! route-map 11 permit 30 set as-path prepend 29753 exit ! route-map 13 permit 30 set as-path prepend 29753 29753 29753 exit ! ! ! ! rpki rpki polling_period 900 rpki cache 134.196.1.55 3323 preference 1 rpki cache 2602:fed1:ca1::face 3323 preference 2 exit !
Is this the whole log? Terminating on signal
tells it's kinda killed with SIGTERM or SIGINT.
yes, whole log. and when i input vtysh command, frr no any output.
I have tested on 4 servers and this happens all. 4 servers are all ubuntu21.10 systems.
2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2602:fed3:7021::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a06:e882:119::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a0d:2405:511::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a10:2f02:100::/48: Failed to enqueue dataplane install 2022/03/22 20:50:03 STATIC: [MRN6F-AYZC4] Terminating on signal 2022/03/22 20:50:04 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down. 2022/03/22 20:50:05 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 16 disconnected 141674 bgp routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 16 disconnected 0 bgp nhgs removed from the rib 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 31 disconnected 0 vnc routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 31 disconnected 0 vnc nhgs removed from the rib 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 38 disconnected 0 static routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 38 disconnected 0 static nhgs removed from the rib 2022/03/22 20:50:09 ZEBRA: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 43: Broken pipe 2022/03/22 20:50:09 ZEBRA: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 43, closing 2022/03/22 20:50:09 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0 2022/03/22 20:54:22 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All 2022/03/22 20:54:22 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fede:ca1:b::11 due to change in addpath config 2022/03/22 20:54:25 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c2:22:a:1:: in vrf default 2022/03/22 20:54:54 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/22 20:54:56 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fede:ca1:b::11 in vrf default
It's too hard to figure out what you are trying to show when you dump information this way. I've asked you to use the template repeatedly and you never do it. You need to use the template in order for others to make sense of the issues you're reporting.
the same report in mail list.
Today's Topics:
Hi everyone,
Just following up on my previous note about BGPD hanging in FRR 8.2.2. I now have more info to share.
As background, I've got around 60 BGP feeds total in 30 different "views", to form a route collector for analysis work I'm doing of the global R&E routing table.
This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu 20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2.
The latest hang earlier today allowed a colleague to grab debug info which I hope will help.
/var/log/frr/frr.log shows entries like this:
Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1674696 still running after 20 seconds, sending signal 15 Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1674696 terminated due to signal 15
I'll try to replicate and work on this.
@liuxyon could you enable debug rpki
? Also, it would be useful to have show memory | include RPKI
(as late as possible before not responding). And ps aufx | grep bgpd
+ free -m
.
and work on this
Since version 8.2.2 cannot be used, we have all returned to using version 8.1
I can't replicate this with 100k routes and two full RPKI validators (cache servers), but just found a memory leak (which might be a possible reason, don't know, that's why I asked for more details).
@ton31337 I'm still staying with 8.2.2 and happy to help troubleshoot this. Will get you debug rpki
etc when it next happens. For me, it's every 5 days this happens (sorry, we'll have to wait). Got 60 peers, probably 5 of them giving me full tables in v4 and v6, the rest just global R&E routes (which is about 20k IPv4 and 6k IPv6). Let me know if anything else needed.
@pfsinoz cool, let me know when you have more details (as I described in a previous comment).
@ton31337 is the stack trace I have from the last hang of any use at all?
@ton31337 is the stack trace I have from the last hang of any use at all?
At least it's quite clear that RPKI-related...
BTW, just for the record, this is what things look like with "situation normal":
******** (sh memory | include RPKI) *******
BGP RPKI Cache server : 1222452 variable 49351728 2444538 109382928
BGP RPKI Cache server group : 0 120 0 1 120
******** (free -m) ******
total used free shared buff/cache available
Mem: 9961 7030 308 1 2621 2613
Swap: 0 0 0
******** (ps aufx | grep bgpd) ******
root 1707406 0.0 0.0 8328 3036 ? S<s Apr02 0:36 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr 1707428 6.2 61.5 6587548 6278184 ? S<sl Apr02 259:33 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
Now we just have to wait for the next hang - probably 3-4 days time.
@pfsinoz maybe you have more details about this?
@ton31337 Frustratingly it has not hung since! I'm still waiting, still gathering the data every hour. I've had to restart the system once for another reason, but still no hang since. This is the latest snapshot, from about 40 minutes ago:
******** (sh memory | include RPKI) *******
BGP RPKI Cache server : 1231005 variable 49697848 2476764 116684816
BGP RPKI Cache server group : 0 120 0 1 120
******** (free -m) ******
total used free shared buff/cache available
Mem: 9960 7102 322 1 2535 2541
Swap: 0 0 0
******** (ps aufx | grep bgpd) ******
root 771 0.0 0.0 8332 3060 ? S<s Apr13 1:14 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr 812 6.3 60.3 6511740 6154024 ? S<sl Apr13 579:04 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki
I've had a couple of instances where the sh ip bgp
on a full feed has caused the clogin
driven by my scripts to timeout. But not repeatable.
@ton31337 just a quick update... FRR has been up and running for last 12 days now and not exhibited the hang issue. The full BGP feeds do pause for about 20-30 seconds when I do a "sh ip bgp" on them, but I can replicate that on other FRR versions too. I'm left wondering if there were any validator issues that perhaps led to "funny" VRPs being sent to FRR, but I can't even think what those might be. Just weird that the issue has seemingly gone away all by itself. I'm happy to test new/updated code if need be.
@pfsinoz thank you for the update. We are going to revert the latest changes related to connection handling (workarounds) that are fixed in librtr itself (0.8.0). https://github.com/FRRouting/frr/pull/11138
You just have to make sure you have librtr 0.8.0 version.
I'm currently facing this exact issue, FRR continues to crash and not recover.
running frr v8.2.2 use ubuntu 20.0.4 and debian11 version in ubuntu 21.10 system, The routing system is stuck for no reason, causing the frr system to crash. I haven't found the reason yet, but is there any way to find out why?
Also request the release of frr for the latest system version of ubutntu. like ubuntu 21.10 and 21.04