FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.31k stars 1.25k forks source link

Incorrect BGP packet may cause Zebra to hang #12725

Closed mwinter-osr closed 1 year ago

mwinter-osr commented 1 year ago

This is found on Ubuntu with FRR master @aa16204dfbff (Jan 31). The issue DOES NOT exist in 8.4

During Testing, when an invalid BGP open is sent with the first octet of the marker field overwritten with 0, Zebra ends up hanging and will no longer respond to any vtysh command or output any logs. No logs are given for this error.

TCP Payload for the BGP Open Message:

TCP:  00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF   ................
TCP:  00 25 01 04 01 F5 00 5A C0 A8 01 01 08 02 06 01   .%.....Z........
TCP:  04 00 01 00 01                                    .....```

Issue is reproducable with IxANVL Test BGP4-13.1
Bad commit (found by bisecting) is:

commit a0b937de428e14e869b8541f0b7810113d619c2e Author: Stephen Worley sworley@nvidia.com Date: Fri Oct 21 12:45:50 2022 -0400

bgpd,doc: limit InQ buf to allow for back pressure

Add a default limit to the InQ for messages off the bgp peer
socket. Make the limit configurable via cli.

Adding in this limit causes the messages to be retained in the tcp
socket and allow for tcp back pressure and congestion control to kick
in.

Before this change, we allow the InQ to grow indefinitely just taking
messages off the socket and adding them to the fifo queue, never letting
the kernel know we need to slow down. We were seeing under high loads of
messages and large perf-heavy routemaps (regex matching) this queue
would cause a memory spike and BGP would get OOM killed. Modifying this
leaves the messages in the socket and distributes that load where it
should be in the socket buffers on both send/recv while we handle the
mesages.

Also, changes were made to allow the ringbuffer to hold messages and
continue to be filled by the IO pthread while we wait for the Main
pthread to handle the work on the InQ.
[...]```

Thread info when attaching with GDB:


Quit
(gdb) thread apply all bt full

Thread 7 (Thread 0x7f11cf669700 (LWP 30024)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x625000043900, nfds=3, timeout=<optimized out>, sigmask=0x7f11cf668960) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095737182464, tv_nsec = 3}
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x613000041340, timer_wait=0x0, eintr_p=0x7f11cf668b00) at lib/thread.c:946
        origsigs = {__val = {18446744067266838271, 0 <repeats 15 times>}}
        trash = "\263\212\265A\000\000\000\000\000\244^\327\021\177\000\000\"nN\327\021\177\000\000\200\023\004\000\060a\000\000A\000\000\000\000\000\000\000\000\214f\317\021\177\000\000\240\212f\317\021\177\000\000\\\321\354\071\342\017\000"
        count = 2
        timeout = -1
        num = 24800
        __func__ = <optimized out>
        ts = {tv_sec = 1168, tv_nsec = 717546}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x613000041340, fetch=0x7f11cf668cf0) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 1168, tv_usec = 717544}
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 0, tv_usec = 17464308846986}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d73d8c5c in fpt_run (arg=0x60d00002f4b0) at lib/frr_pthread.c:308
        fpt = 0x60d00002f4b0
        sleeper = {41, 42}
        __func__ = <optimized out>
        task = {type = 4 '\004', add_type = 1 '\001', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x620000005118, master = 0x613000041340, func = 0x5575b03b5ce7 <zserv_write>, 
          arg = 0x620000005080, u = {val = 38, fd = 38, sands = {tv_sec = 38, tv_usec = 0}}, real = {tv_sec = 1168, tv_usec = 717546}, hist = 0x60800002f7a0, yield = 10000, xref = 0x5575b06ea400 <_xref.23635>, 
          mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00007f11d73d7b23 in frr_pthread_inner (arg=0x60d00002f4b0) at lib/frr_pthread.c:158
        fpt = 0x60d00002f4b0
#6  0x00007f11d6e176db in start_thread (arg=0x7f11cf669700) at pthread_create.c:463
        pd = 0x7f11cf669700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714470778624, -7682549099398598435, 139714470776576, 0, 140724148465408, 140724148465232, 7728604126545277149, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 6 (Thread 0x7f11cfe83700 (LWP 30023)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x625000039900, nfds=3, timeout=<optimized out>, sigmask=0x7f11cfe82960) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095737141504, tv_nsec = 3}
---Type <return> to continue, or q <return> to quit---
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x613000040fc0, timer_wait=0x0, eintr_p=0x7f11cfe82b00) at lib/thread.c:946
        origsigs = {__val = {18446744067266838271, 0 <repeats 15 times>}}
        trash = "\263\212\265A\000\000\000\000\000\244^\327\021\177\000\000\"nN\327\021\177\000\000\000\020\004\000\060a\000\000\254\000\000\000\000\000\000\000\000,\350\317\021\177\000\000\240*\350\317\021\177\000\000\\\005\375\071\342\017\000"
        count = 2
        timeout = -1
        num = 24800
        __func__ = <optimized out>
        ts = {tv_sec = 5336, tv_nsec = 263358}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x613000040fc0, fetch=0x7f11cfe82cf0) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 5336, tv_usec = 263356}
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 0, tv_usec = 17464309908874}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d73d8c5c in fpt_run (arg=0x60d00002f3e0) at lib/frr_pthread.c:308
        fpt = 0x60d00002f3e0
        sleeper = {36, 37}
        __func__ = <optimized out>
        task = {type = 4 '\004', add_type = 1 '\001', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x620000004118, master = 0x613000040fc0, func = 0x5575b03b5ce7 <zserv_write>, 
          arg = 0x620000004080, u = {val = 33, fd = 33, sands = {tv_sec = 33, tv_usec = 0}}, real = {tv_sec = 5336, tv_usec = 263358}, hist = 0x60800002f6a0, yield = 10000, xref = 0x5575b06ea400 <_xref.23635>, 
          mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00007f11d73d7b23 in frr_pthread_inner (arg=0x60d00002f3e0) at lib/frr_pthread.c:158
        fpt = 0x60d00002f3e0
#6  0x00007f11d6e176db in start_thread (arg=0x7f11cfe83700) at pthread_create.c:463
        pd = 0x7f11cfe83700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714479273728, -7682549099398598435, 139714479271680, 0, 140724148465408, 140724148465232, 7728605340947280093, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 5 (Thread 0x7f11d069c700 (LWP 29999)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x62500002f900, nfds=3, timeout=<optimized out>, sigmask=0x7f11d069b960) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095737100544, tv_nsec = 3}
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x613000040c40, timer_wait=0x0, eintr_p=0x7f11d069bb00) at lib/thread.c:946
        origsigs = {__val = {18446744067266838271, 0 <repeats 15 times>}}
        trash = "\263\212\265A\000\000\000\000\000\244^\327\021\177\000\000\"nN\327\021\177\000\000\200\f\004\000\060a\000\000h\000\000\000\000\000\000\000\000\274i\320\021\177\000\000\240\272i\320\021\177\000\000\\7\r:\342\017\000"
        count = 2
---Type <return> to continue, or q <return> to quit---
        timeout = -1
        num = 24800
        __func__ = <optimized out>
        ts = {tv_sec = 4822, tv_nsec = 631881}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x613000040c40, fetch=0x7f11d069bcf0) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 4822, tv_usec = 631879}
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 0, tv_usec = 17464310970250}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d73d8c5c in fpt_run (arg=0x60d00002c660) at lib/frr_pthread.c:308
        fpt = 0x60d00002c660
        sleeper = {31, 32}
        __func__ = <optimized out>
        task = {type = 4 '\004', add_type = 1 '\001', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x620000003118, master = 0x613000040c40, func = 0x5575b03b5ce7 <zserv_write>, 
          arg = 0x620000003080, u = {val = 15, fd = 15, sands = {tv_sec = 15, tv_usec = 0}}, real = {tv_sec = 4822, tv_usec = 631881}, hist = 0x60800002f520, yield = 10000, xref = 0x5575b06ea400 <_xref.23635>, 
          mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00007f11d73d7b23 in frr_pthread_inner (arg=0x60d00002c660) at lib/frr_pthread.c:158
        fpt = 0x60d00002c660
#6  0x00007f11d6e176db in start_thread (arg=0x7f11d069c700) at pthread_create.c:463
        pd = 0x7f11d069c700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714487764736, -7682549099398598435, 139714487762688, 0, 140724148465408, 140724148465232, 7728663628485325021, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 4 (Thread 0x7f11d0ec5700 (LWP 29993)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x625000023100, nfds=2, timeout=<optimized out>, sigmask=0x7f11d0ec4960) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095737049344, tv_nsec = 2}
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x6130000408c0, timer_wait=0x0, eintr_p=0x7f11d0ec4b00) at lib/thread.c:946
        origsigs = {__val = {18446744067266838271, 0 <repeats 15 times>}}
        trash = "\263\212\265A\000\000\000\000\000\244^\327\021\177\000\000\"nN\327\021\177", '\000' <repeats 19 times>, "L\354\320\021\177\000\000\240J\354\320\021\177\000\000\\\211\035:\342\017\000"
        count = 1
        timeout = -1
        num = 0
        __func__ = <optimized out>
        ts = {tv_sec = 1142, tv_nsec = 678575}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x6130000408c0, fetch=0x7f11d0ec4cf0) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 1142, tv_usec = 678573}
---Type <return> to continue, or q <return> to quit---
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 0, tv_usec = 17464312039818}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d73d8c5c in fpt_run (arg=0x60d0000299b0) at lib/frr_pthread.c:308
        fpt = 0x60d0000299b0
        sleeper = {22, 23}
        __func__ = <optimized out>
        task = {type = 4 '\004', add_type = 3 '\003', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x5575b07f29a8 <zo_info+40>, master = 0x6130000408c0, 
          func = 0x5575b02fbc07 <process_messages>, arg = 0x0, u = {val = 0, fd = 0, sands = {tv_sec = 0, tv_usec = 0}}, real = {tv_sec = 1142, tv_usec = 678575}, hist = 0x60800002ef20, yield = 10000, 
          xref = 0x5575b06d3be0 <_xref.20217>, mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00007f11d73d7b23 in frr_pthread_inner (arg=0x60d0000299b0) at lib/frr_pthread.c:158
        fpt = 0x60d0000299b0
#6  0x00007f11d6e176db in start_thread (arg=0x7f11d0ec5700) at pthread_create.c:463
        pd = 0x7f11d0ec5700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714496321280, -7682549099398598435, 139714496319232, 0, 140724148466800, 140724148466624, 7728664683436667101, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 3 (Thread 0x7f11d16fe700 (LWP 29992)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x625000019100, nfds=3, timeout=<optimized out>, sigmask=0x7f11d16fd960) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095737008384, tv_nsec = 3}
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x613000040380, timer_wait=0x0, eintr_p=0x7f11d16fdb00) at lib/thread.c:946
        origsigs = {__val = {18446744067266838271, 0 <repeats 15 times>}}
        trash = "\263\212\265A\000\000\000\000\000\244^\327\021\177\000\000\"nN\327\021\177\000\000\000\000\000\000\000\000\000\000\331\006\000\000\000\000\000\000\000\334o\321\021\177\000\000\240\332o\321\021\177\000\000\\\373-:\342\017\000"
        count = 2
        timeout = -1
        num = 24800
        __func__ = <optimized out>
        ts = {tv_sec = 5441, tv_nsec = 404021}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x613000040380, fetch=0x7f11d16fdcf0) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 5441, tv_usec = 404018}
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 0, tv_usec = 17464313117578}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d73d8c5c in fpt_run (arg=0x60d0000298e0) at lib/frr_pthread.c:308
---Type <return> to continue, or q <return> to quit---
        fpt = 0x60d0000298e0
        sleeper = {18, 19}
        __func__ = <optimized out>
        task = {type = 4 '\004', add_type = 3 '\003', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x5575b07f2538 <zdplane_info+312>, master = 0x613000040380, 
          func = 0x5575b0286195 <dplane_thread_loop>, arg = 0x0, u = {val = 0, fd = 0, sands = {tv_sec = 0, tv_usec = 0}}, real = {tv_sec = 5441, tv_usec = 404021}, hist = 0x60800002eda0, yield = 10000, 
          xref = 0x5575b06c5c20 <_xref.27137>, mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00007f11d73d7b23 in frr_pthread_inner (arg=0x60d0000298e0) at lib/frr_pthread.c:158
        fpt = 0x60d0000298e0
#6  0x00007f11d6e176db in start_thread (arg=0x7f11d16fe700) at pthread_create.c:463
        pd = 0x7f11d16fe700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714504943360, -7682549099398598435, 139714504941312, 0, 140724148466864, 140724148466688, 7728661381143687389, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 2 (Thread 0x7f11d1eff700 (LWP 29991)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1  0x00007f11d74b082f in sys_futex (addr1=0x7f11d79433e0 <rcu_seq>, op=0, val1=3, timeout=0x0, addr2=0x0, val3=0) at lib/seqlock.c:53
No locals.
#2  0x00007f11d74b0c43 in seqlock_wait (sqlo=0x7f11d79433e0 <rcu_seq>, val=1) at lib/seqlock.c:153
        cur = 1
        cal = 4294967295
        __func__ = <optimized out>
#3  0x00007f11d73d6334 in rcu_main (arg=0x0) at lib/frrcu.c:429
        rt = 0x7f11d2f00000
        rh = 0x0
        end = false
        maxwait = {tv_sec = 0, tv_nsec = 2}
        rcuval = 1
        __func__ = <optimized out>
#4  0x00007f11d6e176db in start_thread (arg=0x7f11d1eff700) at pthread_create.c:463
        pd = 0x7f11d1eff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139714513336064, -7682549099398598435, 139714513334016, 0, 140724148466400, 140724148466224, 7728662479044702429, 7728660258836279517}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#5  0x00007f11d6b4061f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.

Thread 1 (Thread 0x7f11d8b20800 (LWP 29990)):
#0  0x00007f11d6b33cb6 in __GI_ppoll (fds=0x625000007900, nfds=7, timeout=<optimized out>, sigmask=0x7ffce4e17b10) at ../sysdeps/unix/sysv/linux/ppoll.c:39
        resultvar = 18446744073709551102
        sc_cancel_oldtype = 0
        sc_ret = <optimized out>
        tval = {tv_sec = 108095736936704, tv_nsec = 7}
#1  0x00007f11d79ee3fc in ppoll () from /usr/lib/x86_64-linux-gnu/libasan.so.4
No symbol table info available.
#2  0x00007f11d74f232e in fd_poll (m=0x613000000200, timer_wait=0x0, eintr_p=0x7ffce4e17cb0) at lib/thread.c:946
        origsigs = {__val = {0 <repeats 16 times>}}
---Type <return> to continue, or q <return> to quit---
        trash = "\360|\341\344\374\177\000\000\000\002\000\000\060a\000\000\"nN", '\000' <repeats 13 times>, "\263\212\265A\000\000\000\000\260}\341\344\374\177\000\000P|\341\344\374\177\000\000\222/\234\234\377\017\000"
        count = 6
        timeout = -1
        num = 0
        __func__ = <optimized out>
        ts = {tv_sec = 53985, tv_nsec = 248084}
        tsp = 0x0
#3  0x00007f11d74f7b0c in thread_fetch (m=0x613000000200, fetch=0x7ffce4e17e20) at lib/thread.c:1846
        thread = 0x0
        now = {tv_sec = 53985, tv_usec = 266486}
        zerotime = {tv_sec = 0, tv_usec = 0}
        tv = {tv_sec = 17, tv_usec = 290946}
        tw = 0x0
        eintr_p = false
        num = 0
        __func__ = <optimized out>
#4  0x00007f11d74072be in frr_run (master=0x613000000200) at lib/libfrr.c:1197
        instanceinfo = '\000' <repeats 63 times>
        __func__ = "frr_run"
        thread = {type = 4 '\004', add_type = 0 '\000', threaditem = {si = {next = 0x0}}, timeritem = {hi = {index = 0}}, ref = 0x61500003a448, master = 0x613000000200, func = 0x5575b01f3b34 <kernel_read>, 
          arg = 0x61500003a280, u = {val = 10, fd = 10, sands = {tv_sec = 10, tv_usec = 0}}, real = {tv_sec = 53985, tv_usec = 248084}, hist = 0x60800002c7a0, yield = 10000, xref = 0x5575b06b7c20 <_xref.22963>, 
          mtx = pthread_mutex_t = {Type = Normal, Status = Not acquired, Robust = No, Shared = No, Protocol = None}, ignore_timer_late = false}
#5  0x00005575b01ffe41 in main (argc=2, argv=0x7ffce4e181e8) at zebra/main.c:476
        zserv_path = 0x0
        dummy = {ss_family = 65535, 
          __ss_padding = '\377' <repeats 14 times>, "\000\000\000\000\000\000\000\000\000\000\377\377\377\377\377\377\000\000\000\000\000\000\000\000\000\070TD\224\267\254\366\000\000\000\000\000\000\000\000\225\000\000\000\000\000\000\000x\323j\260uU\000\000\002\000\000\000\000\000\000\000\350\201\341\344\374\177\000\000\000\202\341\344\374\177\000\000\260\200\341\344\374\177\000\000ͷW\327\021\177\000\000x\323j\260uU\000", __ss_align = 93963959492384}
        dummylen = 3771907175
        asic_offload = false
        notify_on_ack = true
        __func__ = <optimized out>
(gdb) ```
mwinter-osr commented 1 year ago

Here is a way to reproduce the issue:

Setup a simple network with 2 boxes: one running FRR (DUT) and one running just plain Linux for the test tool (TESTER):

+-----------+                    +------------+
|           |   192.168.1.0/24   |            |
|  TESTER   +--------------------+   DUT      |
|           | .1            .101 |            |
+-----------+                    +------------+

Configure the interface on the TESTER side to 192.168.1.1/24

Start zebra, staticd and bgpd on the DUT and apply the following config:

Current configuration:
!
frr version 8.5-dev-20230131211350-git.aa16204
frr defaults traditional
hostname bgp-marker-dut
log file /tmp/frr.log
no ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
debug zebra events
debug bgp keepalives
debug bgp neighbor-events
debug bgp zebra
!
interface ens1
 ip address 192.168.1.101/24
exit
!
router bgp 501
 neighbor 192.168.1.1 remote-as 500
exit
!
end

Now build the BGPTOOL ( https://git-us.netdef.org/scm/netdef/bgptool.git ) and run the executable test_bgp_bad-open-message_marker

Approx 10..15 sec later, zebra will be hanging. (as seen with vtysh commands).

donaldsharp commented 1 year ago

in my testing it does not look like zebra becomes unresponsize, bgpd does

2023-02-02 13:27:24.016 [INFO] watchfrr: [YFT0P-5Q5YX] Forked background command [pid 2488116]: /usr/lib/frr/watchfrr.sh restart bgpd

show thread cpu is stalling on bgp