FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.31k stars 1.25k forks source link

8.2.2 stuck with high number of peers/routes with RPKI #10826

Closed liuxyon closed 2 years ago

liuxyon commented 2 years ago

running frr v8.2.2 use ubuntu 20.0.4 and debian11 version in ubuntu 21.10 system, The routing system is stuck for no reason, causing the frr system to crash. I haven't found the reason yet, but is there any way to find out why?

Also request the release of frr for the latest system version of ubutntu. like ubuntu 21.10 and 21.04

ton31337 commented 2 years ago

Can you provide at least a configuration?

donaldsharp commented 2 years ago

or logs? This is pretty useless bug report.

liuxyon commented 2 years ago

2022/03/19 02:40:39 STATIC: [MRN6F-AYZC4] Terminating on signal 2022/03/19 02:40:39 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal 2022/03/19 02:40:39 ZEBRA: [GE156-FS0MJ][EC 100663299] stream_read_try: read failed on fd 39: Connection reset by peer 2022/03/19 02:40:39 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down. 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 17 disconnected 141713 bgp routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 17 disconnected 0 bgp nhgs removed from the rib 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 32 disconnected 0 vnc routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 32 disconnected 0 vnc nhgs removed from the rib 2022/03/19 02:40:39 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/19 02:40:39 ZEBRA: [JPSA8-5KYEA] client 39 disconnected 0 static routes removed from the rib 2022/03/19 02:40:39 ZEBRA: [S929C-NZR3N] client 39 disconnected 0 static nhgs removed from the rib 2022/03/19 02:40:41 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0 2022/03/19 02:44:40 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0 2022/03/19 02:44:40 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All 2022/03/19 02:44:40 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fed1:ca1:b::11 due to change in addpath config 2022/03/19 02:44:43 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c1:22:a:1:: in vrf default 2022/03/19 02:46:40 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/19 02:46:42 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default 2022/03/19 05:08:03 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/19 05:08:05 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fed1:ca1:b::11 in vrf default

liuxyon commented 2 years ago

Some IP addresses are modified or hidden

! ! Zebra configuration saved from vty ! 2022/03/12 19:53:42 ! frr version 8.1 frr defaults traditional ! hostname sir log file /etc/frr/frr.log ! ! ! router bgp 29753 bgp router-id 134.196.121.55 no bgp ebgp-requires-policy no bgp default ipv4-unicast no bgp network import-check neighbor 2602:fed2:ca1:b::11 remote-as 65105 neighbor 2602:fed2:ca1:b::11 description "my local " neighbor 2602:fed2:ca1:b::11 disable-connected-check neighbor 2602:fed2:ca1:b::11 update-source wg1 neighbor 2602:fed2:ca1:b::11 advertisement-interval 0 neighbor 2602:fed2:ca1:b::11 disable-connected-check neighbor 2a09:5c0:fe0:8c::1 remote-as 68057 neighbor 2a09:5c0:fe0:8c::1 description tunnelbroke neighbor 2a09:5c0:fe0:8c::1 update-source AS68057 neighbor 2a09:5c0:fe0:8c::1 advertisement-interval 0 neighbor 2a09:5c0:fe0:8c::1 capability dynamic neighbor 2a09:5c0:fe0:8c::1 sender-as-path-loop-detection neighbor 2a0f:85c1:22:a:1:: remote-as 306628 neighbor 2a0f:85c1:22:a:1:: description "AS306628 " neighbor 2a0f:85c1:22:a:1:: disable-connected-check neighbor 2a0f:85c1:22:a:1:: update-source ens19 neighbor 2a0f:85c1:22:a:1:: capability dynamic ! address-family ipv4 unicast exit-address-family ! address-family ipv6 unicast network 2602:fed1:ca1::/48 neighbor 2602:fed1:ca1:b::11 activate neighbor 2602:fed1:ca1:b::11 addpath-tx-all-paths neighbor 2602:fed1:ca1:b::11 next-hop-self neighbor 2602:fed1:ca1:b::11 remove-private-AS all neighbor 2602:fed1:ca1:b::11 soft-reconfiguration inbound neighbor 2602:fed1:ca1:b::11 prefix-list mycn6out out neighbor 2a0f:85c3:22:a:1:: activate neighbor 2a0f:85c3:22:a:1:: remove-private-AS all neighbor 2a0f:85c3:22:a:1:: soft-reconfiguration inbound neighbor 2a0f:85c3:22:a:1:: prefix-list ipv6in in neighbor 2a0f:85c3:22:a:1:: prefix-list myv6out out neighbor 2a0f:85c3:22:a:1:: route-map A01 in neighbor 2a0f:85c3:22:a:1:: route-map 80 out exit-address-family ! exit ! ipv6 prefix-list ipv6in seq 105 deny ::1/128 ipv6 prefix-list ipv6in seq 110 deny ::/128 ipv6 prefix-list ipv6in seq 120 deny 3ffe::/16 le 128 ipv6 prefix-list ipv6in seq 130 deny 2001:db8::/32 le 128 ipv6 prefix-list ipv6in seq 140 deny 2001::/32 ipv6 prefix-list ipv6in seq 150 deny 2001::/32 le 128 ipv6 prefix-list ipv6in seq 160 permit 2002::/16 ipv6 prefix-list ipv6in seq 170 deny 2002::/16 le 128 ipv6 prefix-list ipv6in seq 180 deny ::/8 le 128 ipv6 prefix-list ipv6in seq 190 deny fe00::/9 le 128 ipv6 prefix-list ipv6in seq 200 deny ff00::/8 le 128 ipv6 prefix-list ipv6in seq 205 permit 2000::/3 le 48 ipv6 prefix-list ipv6in seq 900 deny ::/0 le 128 ipv6 prefix-list ipv6in seq 999 deny any ipv6 prefix-list mycn6out seq 5 deny ::1/128 ipv6 prefix-list mycn6out seq 10 deny ::/128 ipv6 prefix-list mycn6out seq 15 deny 3ffe::/16 le 128 ipv6 prefix-list mycn6out seq 20 deny 2001:db8::/32 le 128 ipv6 prefix-list mycn6out seq 25 deny 2001:10::/28 le 128 ipv6 prefix-list mycn6out seq 30 deny 2001:2::/48 le 128 ipv6 prefix-list mycn6out seq 35 deny 100::/64 le 128 ipv6 prefix-list mycn6out seq 40 deny ::/8 le 128 ipv6 prefix-list mycn6out seq 45 deny fc00::/7 le 128 ipv6 prefix-list mycn6out seq 50 deny ff00::/8 le 128 ipv6 prefix-list mycn6out seq 55 deny 2002::/16 le 128 ipv6 prefix-list mycn6out seq 60 deny ::/0 ge 49 le 128 ipv6 prefix-list mycn6out seq 110 permit 2000::/3 le 48 ipv6 prefix-list mycn6out seq 999 deny any ipv6 prefix-list myv6out seq 50 permit 2602:fed3:7021::/48 ipv6 prefix-list myv6out seq 100 permit 2602:fed1:ca1::/48 ipv6 prefix-list myv6out seq 999 deny any ! bgp as-path access-list 2 seq 5 deny ^([0-9]+)(\1)+$ bgp as-path access-list 2 seq 10 permit .* bgp as-path access-list 99 seq 5 permit (4294967[0-1][0-9][0-9])|(42949672[0-8][0-9])|(429496729[0-4]) bgp as-path access-list 99 seq 10 permit (42949[0-5][0-9][0-9][0-9][0-9])|(429496[0-6][0-9][0-9][0-9]) bgp as-path access-list 99 seq 15 permit (429[0-3][0-9][0-9][0-9][0-9][0-9][0-9])|(4294[0-8][0-9][0-9][0-9][0-9][0-9]) bgp as-path access-list 99 seq 20 permit (6449[6-9])|(6450[0-9])|(6451[0-1])|(6553[6-9])|(6554[0-9])|(6555[0-1])_ bgp as-path access-list 99 seq 25 permit 0 bgp as-path access-list 99 seq 30 permit 1310[0-6][0-9]|13107[0-1] bgp as-path access-list 99 seq 35 permit 23456 bgp as-path access-list 99 seq 40 permit 42[0-8][0-9][0-9][0-9][0-9][0-9][0-9][0-9] bgp as-path access-list 99 seq 45 permit 6(4(5(1[2-9]|[2-9][0-9])|[6-9][0-9][0-9])|5([0-4][0-9][0-9]|5([0-2][0-9]|3[0-5]))) bgp as-path access-list 99 seq 50 permit 6555[2-9]|655[6-9][0-9]|65[6-9][0-9][0-9]|6[6-9][0-9][0-9][0-9] bgp as-path access-list 99 seq 55 permit [7-9][0-9][0-9][0-9][0-9]|1[0-2][0-9][0-9][0-9][0-9]|130[0-9][0-9][0-9] ! ! route-map 80 permit 50 set local-preference 100 set metric 0 exit ! route-map A01 deny 11 match as-path 99 exit ! route-map A01 deny 20 match rpki invalid exit ! route-map A01 permit 25 match as-path 2 exit ! route-map A01 permit 30 match rpki notfound set local-preference 100 set metric 0 set as-path prepend last-as 1 exit ! route-map A01 permit 50 match rpki valid set local-preference 200 set metric 0 exit ! route-map 05 deny 20 match rpki invalid exit ! route-map 05 permit 30 match rpki notfound set metric 0 exit ! route-map 05 permit 50 match rpki valid set metric 0 exit ! route-map A02 deny 11 match as-path 99 exit ! route-map A02 deny 20 match rpki invalid exit ! route-map A02 permit 25 match as-path 2 exit ! route-map A02 permit 30 match rpki notfound set local-preference 100 set metric 0 set as-path prepend last-as 5 exit ! route-map A02 permit 50 match rpki valid set local-preference 100 set metric 0 set as-path prepend last-as 3 exit ! route-map 11 permit 30 set as-path prepend 29753 exit ! route-map 13 permit 30 set as-path prepend 29753 29753 29753 exit ! ! ! ! rpki rpki polling_period 900 rpki cache 134.196.1.55 3323 preference 1 rpki cache 2602:fed1:ca1::face 3323 preference 2 exit !

ton31337 commented 2 years ago

Is this the whole log? Terminating on signal tells it's kinda killed with SIGTERM or SIGINT.

liuxyon commented 2 years ago

yes, whole log. and when i input vtysh command, frr no any output. IMG_20220322_184530

I have tested on 4 servers and this happens all. 4 servers are all ubuntu21.10 systems.

liuxyon commented 2 years ago

2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2602:fed3:7021::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a06:e882:119::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a0d:2405:511::/48: Failed to enqueue dataplane install 2022/03/22 20:50:01 ZEBRA: [SWQK6-6JY63][EC 4043309074] 0:254:2a10:2f02:100::/48: Failed to enqueue dataplane install 2022/03/22 20:50:03 STATIC: [MRN6F-AYZC4] Terminating on signal 2022/03/22 20:50:04 ZEBRA: [VXKFG-8SJRV][EC 4043309121] Client 'static' encountered an error and is shutting down. 2022/03/22 20:50:05 ZEBRA: [XVBTQ-5QTVQ] Terminating on signal 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 16 disconnected 141674 bgp routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 16 disconnected 0 bgp nhgs removed from the rib 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 31 disconnected 0 vnc routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 31 disconnected 0 vnc nhgs removed from the rib 2022/03/22 20:50:06 ZEBRA: [YDZ55-W3VM6] release_daemon_table_chunks: Released 0 table chunks 2022/03/22 20:50:06 ZEBRA: [JPSA8-5KYEA] client 38 disconnected 0 static routes removed from the rib 2022/03/22 20:50:06 ZEBRA: [S929C-NZR3N] client 38 disconnected 0 static nhgs removed from the rib 2022/03/22 20:50:09 ZEBRA: [YAF85-253AP][EC 100663299] buffer_flush_available: write error on fd 43: Broken pipe 2022/03/22 20:50:09 ZEBRA: [THHDB-YPEY6][EC 100663299] vtysh_flush: write error to fd 43, closing 2022/03/22 20:50:09 ZEBRA: [QS0NJ-H5QKJ] Zebra final shutdown 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 17 says hello and bids fair to announce only bgp routes vrf=0 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 32 says hello and bids fair to announce only vnc routes vrf=0 2022/03/22 20:54:22 ZEBRA: [V98V0-MTWPF] client 39 says hello and bids fair to announce only static routes vrf=0 2022/03/22 20:54:22 BGP: [GNAYN-F5F1G] Computing addpath IDs for addpath type All 2022/03/22 20:54:22 BGP: [MNE5N-K0G4Z] Resetting peer 2602:fede:ca1:b::11 due to change in addpath config 2022/03/22 20:54:25 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2a0f:85c2:22:a:1:: in vrf default 2022/03/22 20:54:54 BGP: [MNE5N-K0G4Z] Resetting peer (null) due to change in addpath config 2022/03/22 20:54:56 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from 2602:fede:ca1:b::11 in vrf default

qlyoung commented 2 years ago

It's too hard to figure out what you are trying to show when you dump information this way. I've asked you to use the template repeatedly and you never do it. You need to use the template in order for others to make sense of the issues you're reporting.

liuxyon commented 2 years ago

the same report in mail list.

Today's Topics:

  1. BGPD hanging in FRR 8.2.2 (Philip Smith) Message: 1 Date: Sat, 2 Apr 2022 20:47:42 +0100 From: Philip Smith philip@nsrc.org To: frog@lists.frrouting.org Subject: [FROG] BGPD hanging in FRR 8.2.2 Message-ID: 54869a9a-07db-2033-cc16-c0b8a6612060@nsrc.org Content-Type: text/plain; charset=UTF-8; format=flowed

Hi everyone,

Just following up on my previous note about BGPD hanging in FRR 8.2.2. I now have more info to share.

As background, I've got around 60 BGP feeds total in 30 different "views", to form a route collector for analysis work I'm doing of the global R&E routing table.

This hang seems to have a period of 5-7 days. Using FRR 8.2.2 on Ubuntu 20.04. Not had any issue with FRR 8.1.0; this only started with FRR 8.2.2.

The latest hang earlier today allowed a colleague to grab debug info which I hope will help.

/var/log/frr/frr.log shows entries like this:

Apr 2 11:46:42 frr watchfrr[52904]: [T58XM-TP956][EC 268435457] bgpd state -> unresponsive : no response yet to ping sent 90 seconds ago Apr 2 11:46:42 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1674696]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 11:47:02 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1674696 still running after 20 seconds, sending signal 15 Apr 2 11:47:02 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1674696 terminated due to signal 15

Apr 2 14:18:03 frr watchfrr[52904]: [YFT0P-5Q5YX] Forked background command [pid 1697956]: /usr/lib/frr/watchfrr.sh restart bgpd Apr 2 14:18:23 frr watchfrr[52904]: [ZE9RA-19PS5] restart bgpd child process 1697956 still running after 20 seconds, sending signal 15 Apr 2 14:18:23 frr watchfrr[52904]: [SK7QP-A2GT9] restart bgpd process 1697956 terminated due to signal 15 which just repeat every 10 minutes or so. A few hours earlier I was getting: Apr 1 22:53:19 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5566a35c01a0 arg=0x556682b31da0 timer r=-5.940 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Apr 1 23:24:34 frr bgpd[52925]: [YZRX4-ZXG0C][EC 100663315] Thread Starvation: {(thread *)0x5567954b16c0 arg=0x556682f14870 timer r=-5.224 bgp_announce_route_timer_expired() &paf->t_announce_route from bgpd/bgp_route.c:4763} was scheduled to pop greater than 4s ago Trying to connect by vtysh prints message of day, but never a command prompt. Same if trying to connect via telnet. The only way out is a kill -9 of the BGPD process, followed by a "systemctl restart frr". The process stack for bgpd shows: root@frr:~# cat /proc/52925/stack [<0>] futex_wait_queue_me+0xbb/0x120 [<0>] futex_wait+0x105/0x290 [<0>] do_futex+0x157/0x4d0 [<0>] __x64_sys_futex+0x13f/0x170 [<0>] do_syscall_64+0x57/0x190 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Thread debugging shows: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=, abstime=, block=) at pthread_join_common.c:145 145 pthread_join_common.c: No such file or directory. (gdb) bt #0 __pthread_clockjoin_ex (threadid=139670697043712, thread_return=0x0, clockid=, abstime=, block=) at pthread_join_common.c:145 #1 0x00007f07b1f3d985 in ?? () from /lib/x86_64-linux-gnu/librtr.so.0 #2 0x00007f07b1f38dc1 in rtr_mgr_stop () from /lib/x86_64-linux-gnu/librtr.so.0 #3 0x00007f07b1f53ef0 in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #4 0x00007f07b1f53f7d in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #5 0x00007f07b1f543ca in ?? () from /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so #6 0x00007f07b2586621 in thread_call () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #7 0x00007f07b2540198 in frr_run () from /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 #8 0x00005566800b6678 in main () I've got about 2.5Mbytes of strace which I'll happily unicast to whoever would like to have a look at it. It looks very repetitive/boring to my non-developer eye, like something's got stuck waiting for something else. BTW, this is what's running (after I killed and restarted), including command line options: 1707406 ? S
ton31337 commented 2 years ago

I'll try to replicate and work on this.

ton31337 commented 2 years ago

@liuxyon could you enable debug rpki? Also, it would be useful to have show memory | include RPKI (as late as possible before not responding). And ps aufx | grep bgpd + free -m.

liuxyon commented 2 years ago

and work on this

Since version 8.2.2 cannot be used, we have all returned to using version 8.1

ton31337 commented 2 years ago

I can't replicate this with 100k routes and two full RPKI validators (cache servers), but just found a memory leak (which might be a possible reason, don't know, that's why I asked for more details).

pfsinoz commented 2 years ago

@ton31337 I'm still staying with 8.2.2 and happy to help troubleshoot this. Will get you debug rpki etc when it next happens. For me, it's every 5 days this happens (sorry, we'll have to wait). Got 60 peers, probably 5 of them giving me full tables in v4 and v6, the rest just global R&E routes (which is about 20k IPv4 and 6k IPv6). Let me know if anything else needed.

ton31337 commented 2 years ago

@pfsinoz cool, let me know when you have more details (as I described in a previous comment).

pfsinoz commented 2 years ago

@ton31337 is the stack trace I have from the last hang of any use at all?

ton31337 commented 2 years ago

@ton31337 is the stack trace I have from the last hang of any use at all?

At least it's quite clear that RPKI-related...

pfsinoz commented 2 years ago

BTW, just for the record, this is what things look like with "situation normal":

******** (sh memory | include RPKI) *******
BGP RPKI Cache server         :  1222452 variable  49351728  2444538 109382928
BGP RPKI Cache server group   :        0    120           0        1       120
******** (free -m) ******
              total        used        free      shared  buff/cache   available
Mem:           9961        7030         308           1        2621        2613
Swap:             0           0           0
******** (ps aufx | grep bgpd) ******
root     1707406  0.0  0.0   8328  3036 ?        S<s  Apr02   0:36 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr      1707428  6.2 61.5 6587548 6278184 ?     S<sl Apr02 259:33 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki

Now we just have to wait for the next hang - probably 3-4 days time.

ton31337 commented 2 years ago

@pfsinoz maybe you have more details about this?

pfsinoz commented 2 years ago

@ton31337 Frustratingly it has not hung since! I'm still waiting, still gathering the data every hour. I've had to restart the system once for another reason, but still no hang since. This is the latest snapshot, from about 40 minutes ago:

******** (sh memory | include RPKI) *******
BGP RPKI Cache server         :  1231005 variable  49697848  2476764 116684816
BGP RPKI Cache server group   :        0    120           0        1       120
******** (free -m) ******
              total        used        free      shared  buff/cache   available
Mem:           9960        7102         322           1        2535        2541
Swap:             0           0           0
******** (ps aufx | grep bgpd) ******
root         771  0.0  0.0   8332  3060 ?        S<s  Apr13   1:14 /usr/lib/frr/watchfrr -d -F traditional zebra bgpd staticd
frr          812  6.3 60.3 6511740 6154024 ?     S<sl Apr13 579:04 /usr/lib/frr/bgpd -d -F traditional -Z -M rpki

I've had a couple of instances where the sh ip bgp on a full feed has caused the clogin driven by my scripts to timeout. But not repeatable.

pfsinoz commented 2 years ago

@ton31337 just a quick update... FRR has been up and running for last 12 days now and not exhibited the hang issue. The full BGP feeds do pause for about 20-30 seconds when I do a "sh ip bgp" on them, but I can replicate that on other FRR versions too. I'm left wondering if there were any validator issues that perhaps led to "funny" VRPs being sent to FRR, but I can't even think what those might be. Just weird that the issue has seemingly gone away all by itself. I'm happy to test new/updated code if need be.

ton31337 commented 2 years ago

@pfsinoz thank you for the update. We are going to revert the latest changes related to connection handling (workarounds) that are fixed in librtr itself (0.8.0). https://github.com/FRRouting/frr/pull/11138

You just have to make sure you have librtr 0.8.0 version.

dylanjamesdev commented 2 months ago

I'm currently facing this exact issue, FRR continues to crash and not recover.