flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

high broker load from steady stream of disconnect messages aimed at down level client #6362

Open garlick opened 1 month ago

garlick commented 1 month ago

Problem: @grondo investigated a high load a rank 0 broker on a test cluster. A flux overlay trace revealed a steady stream of

[  +8.042307]  tx * c disconnect 0 [0]
[  +8.042326]  tx * c disconnect 0 [0]
[  +8.042524]  tx * c disconnect 0 [0]
[  +8.042551]  tx * c disconnect 0 [0]
[  +8.042578]  tx * c disconnect 0 [0]
[  +8.042761]  tx * c disconnect 0 [0]
[  +8.042788]  tx * c disconnect 0 [0]
[  +8.043487]  tx * c disconnect 0 [0]
[  +8.043515]  tx * c disconnect 0 [0]
[  +8.043542]  tx * c disconnect 0 [0]

A stack trace of the spinning broker revealed

(gdb) where
#0  __GI___libc_write (nbytes=8, buf=0xffffeda873e0, fd=<optimized out>)
    at ../sysdeps/unix/sysv/linux/write.c:26
#1  __GI___libc_write (fd=<optimized out>, buf=0xffffeda873e0, nbytes=8)
    at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x0000ffff895ae8dc in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#3  0x0000ffff895a56ec in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#4  0x0000ffff895aba8c in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#5  0x0000ffff895b5730 in ?? () from /lib/aarch64-linux-gnu/libzmq.so.5
#6  0x0000ffff895cb3f4 in zmq_send () from /lib/aarch64-linux-gnu/libzmq.so.5
#7  0x0000aaaac5ac7008 in zmqutil_msg_send_ex (sock=0xaaaad2d42a80, 
    msg=0xaaaad2dfbe60, nonblock=<optimized out>)
    at ../common/libzmqutil/msg_zsock.c:52
#8  0x0000aaaac5ab5fe8 in overlay_sendmsg_child (ov=0xaaaad2d39180, 
    msg=0xaaaad2dfbe60) at ./src/broker/overlay.c:805
#9  0x0000aaaac5ae6ad8 in overlay_control_child.constprop.0 (
    ov=0xaaaad2d39180, 
    uuid=0xaaaad2e1dfc0 "f982f794-27d4-464b-88f0-f41976ffdf24", status=0, 
    type=CONTROL_DISCONNECT) at ./src/broker/overlay.c:568
#10 0x0000aaaac5ab77e8 in child_cb (r=<optimized out>, w=<optimized out>, 
    revents=<optimized out>, arg=0xaaaad2d39180) at ./src/broker/overlay.c:1041
#11 0x0000aaaac5ac54a8 in check_cb (loop=0xffff896b24d8 <default_loop_struct>, 
    w=0xaaaad2d43f08, revents=<optimized out>)
    at ../common/libzmqutil/ev_zmq.c:79
#12 0x0000ffff89676504 in ev_invoke_pending (
    loop=0xffff896b24d8 <default_loop_struct>) at libev/ev.c:3770
#13 0x0000ffff8964f044 in ev_run (flags=0, loop=<optimized out>)
    at libev/ev.c:4190
#14 ev_run (flags=0, loop=<optimized out>) at libev/ev.c:4021
#15 flux_reactor_run (r=0xaaaad2d30f10, flags=flags@entry=0)
    at libflux/reactor.c:124
#16 0x0000aaaac5aadb08 in main (argc=<optimized out>, argv=<optimized out>)
    at ./src/broker/broker.c:529

There was a downrev broker in the system

 grondo@pi0:~$ flux version
commands:           0.58.0-92-g8d24e946f
libflux-core:       0.58.0-92-g8d24e946f
libflux-security:   0.10.0
build-options:      +systemd+hwloc==2.4.0+zmq==4.3.4

That broker's logs were filled with

 DROP upstream control topic - : message received before hello handshake

Stopping the downrev broker made the high load stop.

Restarting the broker did not make the high load return.