Closed jameshcorbett closed 8 months ago
I'm not sure the core dump would have been helpful in this case, but I did note that it was truncated at 2GB by coredump.conf(5) limit (according to GDB the actual core size would've been 11GB).
[root@elcap1:garlick]# coredumpctl info
PID: 3572964 (flux-broker-0)
UID: 765 (flux)
GID: 753 (flux)
Signal: 6 (ABRT)
Timestamp: Mon 2024-03-04 17:55:31 PST (17h ago)
Command Line: broker --config-path=/etc/flux/system/conf.d -Scron.directory=/etc/flux/system/cron.d -Srundir=/run/flux -Sstatedir=/var/lib>
Executable: /usr/libexec/flux/cmd/flux-broker
Control Group: /system.slice/flux.service
Unit:
Slice: system.slice
Boot ID: e04e401b8fa8485393a40c78f1c01306
Machine ID: 9378f9a987e84e6c9194c1738fd5efea
Hostname: elcap1
Storage: /var/lib/systemd/coredump/core.flux-broker-0.765.e04e401b8fa8485393a40c78f1c01306.3572964.1709603731000000.lz4 (truncated)
Message: Process 3572964 (flux-broker-0) of user 765 dumped core.
Stack trace of thread 3573653:
#0 0x00007ffff61e9acf n/a (n/a)
We've had several crashes lately on that system.
[root@elcap1:systemd]# flux startlog
2023-08-08 10:40 - 2023-08-08 10:40 (0.86s)
2023-08-08 10:49 - 2023-08-31 05:21 (23d)
2023-08-31 07:38 - 2023-09-25 08:46 (25d)
2023-09-25 08:46 - 2023-12-10 17:08 (76d)
2023-12-10 17:34 - 2023-12-15 08:41 (4.6d)
2023-12-15 08:42 - crashed
2024-01-12 14:45 - 2024-01-23 11:29 (11d)
2024-01-23 12:13 - crashed
2024-02-12 11:02 - crashed
2024-02-13 12:24 - crashed
2024-02-15 10:57 - crashed
2024-02-16 21:49 - crashed
2024-02-23 16:07 - 2024-02-29 11:29 (5.8d)
2024-02-29 11:51 - crashed
2024-02-29 12:26 - crashed
2024-03-01 17:27 - crashed
2024-03-01 17:29 - crashed
2024-03-01 18:19 - crashed
2024-03-02 14:13 - crashed
2024-03-02 17:24 - crashed
2024-03-04 17:58 - running (18h)
The earliest log in the journal is Mar 03 20:00:55
(due to #5752?) so it's a bit hard to tell if the same issue has been happening (or if there are other clues)
Another failure, no core captured:
$ flux startlog
[snip]
2024-03-05 15:06 - 2024-03-05 16:22 (1.3h)
2024-03-05 16:23 - crashed
2024-03-05 20:07 - running (9.7h)
# journalctl -u flux
[snip]
Mar 05 20:07:03 elcap1 flux[3768674]: malloc_consolidate(): unaligned fastbin chunk detected
Mar 05 20:07:04 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
Mar 05 20:07:04 elcap1 systemd[1]: flux.service: Failed with result 'signal'.
Another crash, this one immediately preceded by a bunch of imp kill
errors.
Mar 07 14:44:24 elcap1 flux[4035212]: job-exec.info[0]: elcap212 (rank 72): imp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory
Mar 07 14:44:24 elcap1 flux[4035212]: malloc_consolidate(): unaligned fastbin chunk detected
Mar 07 14:44:31 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
No core - my changes to the core dump limits were reverted so it was truncated again.
I was able to reproduce memory corruption (though not on the rank 0 broker) by running a 512 broker instance across 64 nodes with a flat tbon topology -Stbon.topo=kary:512
, e.g.
e.g.
$ flux run -o exit-timeout=none -o pty.interactive -N64 -n 512 src/cmd/flux start -o,-Stbon.topo=kary:800 flux submit --cc=1-8000 --watch --progress --jps -n4 hostname
Will reproduce memory corruption on random brokers (though not rank 0 in my testing). I got two crashes from the above workload before it finished:
flux-job: task(s) exited with exit code 139
malloc_consolidate(): invalid chunk size
^Mfree(): corrupted unsorted chunks
The stack traces, though these were likely victims and not the cause:
(gdb) where
#0 0x0000155553dedacf in raise () from /lib64/libc.so.6
#1 0x0000155553dc0ea5 in abort () from /lib64/libc.so.6
#2 0x0000155553e2ecc7 in __libc_message () from /lib64/libc.so.6
#3 0x0000155553e35fcc in malloc_printerr () from /lib64/libc.so.6
#4 0x0000155553e38100 in _int_free () from /lib64/libc.so.6
#5 0x000000000042389e in flux_msg_destroy (msg=0x547710) at message.c:118
#6 flux_msg_destroy (msg=msg@entry=0x547710) at message.c:111
#7 0x00001555554e7d6a in handle_cb (r=<optimized out>, hw=<optimized out>,
revents=<optimized out>, arg=<optimized out>) at msg_handler.c:540
#8 0x0000155555511333 in ev_invoke_pending (loop=0x155530002ca0) at ev.c:3770
#9 0x0000155555514c05 in ev_run (flags=0, loop=0x155530002ca0) at ev.c:4190
#10 ev_run (loop=0x155530002ca0, flags=0) at ev.c:4021
#11 0x00001555554e713b in flux_reactor_run (r=0x155530002c80,
flags=flags@entry=0) at reactor.c:124
#12 0x000015555544ab73 in mod_main (h=0x155530000be0, argc=<optimized out>,
argv=<optimized out>) at kvs.c:3049
#13 0x00000000004109f6 in module_thread (arg=0x4e8220) at module.c:208
#14 0x000015555510d1ca in start_thread () from /lib64/libpthread.so.0
#15 0x0000155553dd8e73 in clone () from /lib64/libc.so.6
(gdb) where
#0 fzhashx_next (self=0x4dc3f0) at zhashx.c:613
#1 0x00000000004489bd in fzhashx_first (self=<optimized out>) at zhashx.c:584
#2 0x000000000044485e in subhash_topic_match (sh=0x4dc3c0,
topic=0x55ce00 "kvs.namespace-primary-setroot") at subhash.c:103
#3 0x0000000000411811 in module_event_cast (p=0x4dc0f0,
msg=msg@entry=0x539c40) at module.c:635
#4 0x0000000000411db8 in modhash_event_mcast (mh=0x48e910,
msg=msg@entry=0x539c40) at modhash.c:191
#5 0x000000000040e660 in handle_event (ctx=ctx@entry=0x7fffffffa140,
msg=0x539c40) at broker.c:1816
#6 0x000000000040fe1b in overlay_recv_cb (msg=0x7fffffff9d00,
where=OVERLAY_UPSTREAM, arg=0x7fffffffa140) at broker.c:1747
#7 0x0000000000413aed in parent_cb (r=<optimized out>, w=<optimized out>,
revents=<optimized out>, arg=0x4ace00) at overlay.c:1061
#8 0x0000000000420f17 in check_cb (loop=0x15555554a180 <default_loop_struct>,
w=0x4b56e8, revents=<optimized out>) at ev_zmq.c:79
#9 0x0000155555511333 in ev_invoke_pending (
loop=0x15555554a180 <default_loop_struct>) at ev.c:3770
#10 0x0000155555514c05 in ev_run (flags=0,
loop=0x15555554a180 <default_loop_struct>) at ev.c:4190
#11 ev_run (loop=0x15555554a180 <default_loop_struct>, flags=0) at ev.c:4021
#12 0x00001555554e713b in flux_reactor_run (r=0x492970, flags=flags@entry=0)
at reactor.c:124
#13 0x000000000040dd83 in main (argc=<optimized out>, argv=<optimized out>)
at broker.c:532
This might be a false alarm. I noticed that all the memory corruption crashes were on the same node:
$ ls *.core
corona265-flux-broker-425-219062.core corona265-flux-broker-430-245463.core
corona265-flux-broker-426-245426.core corona265-flux-broker-431-225521.core
corona265-flux-broker-428-232894.core corona265-flux-broker-431-245473.core
When I excluded that node from my tests, I was able to run several times and up to 756 brokers without error. :confused:
This may have been fixed by #5803, but we've not been able to get a non-truncated core file on that system yet, so we can't be sure.
Shall we close this issue and open a new one if we start seeing crashes on el cap with the newest flux-core?
Sounds good to me (I'll go ahead and close).
Yesterday afternoon between 17:55 and 17:56 Brian Behlendorf submitted a bunch of jobs on elcap and they all failed with
flux-run: ERROR: No service matching job-ingest.submit is registered
.It looks like the flux broker crashed and restarted around that time? From
journalctl -u flux
:Is there anything else I should look into?