flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

el cap rank 0 broker crashes with malloc(): unaligned fastbin chunk detected #5771

Closed jameshcorbett closed 8 months ago

jameshcorbett commented 9 months ago

Yesterday afternoon between 17:55 and 17:56 Brian Behlendorf submitted a bunch of jobs on elcap and they all failed with flux-run: ERROR: No service matching job-ingest.submit is registered.

It looks like the flux broker crashed and restarted around that time? From journalctl -u flux:

Mar 04 17:55:31 elcap1 flux[3572964]: malloc(): unaligned fastbin chunk detected
Mar 04 17:55:36 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
Mar 04 17:55:36 elcap1 systemd[1]: flux.service: Failed with result 'signal'.
Mar 04 17:55:41 elcap1 systemd[1]: flux.service: Service RestartSec=5s expired, scheduling restart.
Mar 04 17:55:41 elcap1 systemd[1]: flux.service: Scheduled restart job, restart counter is at 7.
Mar 04 17:55:41 elcap1 systemd[1]: Stopped Flux message broker.
Mar 04 17:55:41 elcap1 systemd[1]: Starting Flux message broker...
Mar 04 17:55:42 elcap1 flux[2353244]: broker.info[0]: start: none->join 0.842524ms
Mar 04 17:55:42 elcap1 flux[2353244]: broker.info[0]: parent-none: join->init 0.087773ms
Mar 04 17:55:42 elcap1 systemd[1]: Started Flux message broker.
Mar 04 17:58:48 elcap1 flux[2353244]: kvs.info[0]: restored KVS from checkpoint on 2024-03-05T01:54:29Z
Mar 04 17:58:51 elcap1 flux[2353244]: kvs.info[0]: tagpool expanded from 1024 to 2048 entries
Mar 04 17:58:51 elcap1 flux[2353244]: kvs.info[0]: tagpool expanded from 2048 to 4096 entries
Mar 04 17:58:51 elcap1 flux[2353244]: kvs.info[0]: tagpool expanded from 4096 to 8192 entries
Mar 04 17:58:51 elcap1 flux[2353244]: content.info[0]: tagpool expanded from 1024 to 2048 entries
Mar 04 17:58:51 elcap1 flux[2353244]: content.info[0]: tagpool expanded from 2048 to 4096 entries
Mar 04 17:58:51 elcap1 flux[2353244]: kvs.info[0]: tagpool expanded from 8192 to 16384 entries
Mar 04 17:58:51 elcap1 flux[2353244]: content.info[0]: tagpool expanded from 4096 to 8192 entries
Mar 04 17:58:51 elcap1 flux[2353244]: content.info[0]: tagpool expanded from 8192 to 16384 entries
Mar 04 17:58:52 elcap1 flux[2353244]: cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
Mar 04 17:59:03 elcap1 flux[2353244]: job-manager.info[0]: restart: 2212 jobs
Mar 04 17:59:03 elcap1 flux[2353244]: job-manager.info[0]: restart: 1 running jobs

Is there anything else I should look into?

garlick commented 9 months ago

I'm not sure the core dump would have been helpful in this case, but I did note that it was truncated at 2GB by coredump.conf(5) limit (according to GDB the actual core size would've been 11GB).

[root@elcap1:garlick]# coredumpctl info
           PID: 3572964 (flux-broker-0)
           UID: 765 (flux)
           GID: 753 (flux)
        Signal: 6 (ABRT)
     Timestamp: Mon 2024-03-04 17:55:31 PST (17h ago)
  Command Line: broker --config-path=/etc/flux/system/conf.d -Scron.directory=/etc/flux/system/cron.d -Srundir=/run/flux -Sstatedir=/var/lib>
    Executable: /usr/libexec/flux/cmd/flux-broker
 Control Group: /system.slice/flux.service
          Unit: 
         Slice: system.slice
       Boot ID: e04e401b8fa8485393a40c78f1c01306
    Machine ID: 9378f9a987e84e6c9194c1738fd5efea
      Hostname: elcap1
       Storage: /var/lib/systemd/coredump/core.flux-broker-0.765.e04e401b8fa8485393a40c78f1c01306.3572964.1709603731000000.lz4 (truncated)
       Message: Process 3572964 (flux-broker-0) of user 765 dumped core.

                Stack trace of thread 3573653:
                #0  0x00007ffff61e9acf n/a (n/a)
garlick commented 9 months ago

We've had several crashes lately on that system.

[root@elcap1:systemd]# flux startlog
2023-08-08 10:40 - 2023-08-08 10:40 (0.86s)
2023-08-08 10:49 - 2023-08-31 05:21 (23d)
2023-08-31 07:38 - 2023-09-25 08:46 (25d)
2023-09-25 08:46 - 2023-12-10 17:08 (76d)
2023-12-10 17:34 - 2023-12-15 08:41 (4.6d)
2023-12-15 08:42 - crashed
2024-01-12 14:45 - 2024-01-23 11:29 (11d)
2024-01-23 12:13 - crashed
2024-02-12 11:02 - crashed
2024-02-13 12:24 - crashed
2024-02-15 10:57 - crashed
2024-02-16 21:49 - crashed
2024-02-23 16:07 - 2024-02-29 11:29 (5.8d)
2024-02-29 11:51 - crashed
2024-02-29 12:26 - crashed
2024-03-01 17:27 - crashed
2024-03-01 17:29 - crashed
2024-03-01 18:19 - crashed
2024-03-02 14:13 - crashed
2024-03-02 17:24 - crashed
2024-03-04 17:58 - running (18h)
garlick commented 9 months ago

The earliest log in the journal is Mar 03 20:00:55 (due to #5752?) so it's a bit hard to tell if the same issue has been happening (or if there are other clues)

garlick commented 9 months ago

Another failure, no core captured:

$ flux startlog
[snip]
2024-03-05 15:06 - 2024-03-05 16:22 (1.3h)
2024-03-05 16:23 - crashed
2024-03-05 20:07 - running (9.7h)
# journalctl -u flux
[snip]
Mar 05 20:07:03 elcap1 flux[3768674]: malloc_consolidate(): unaligned fastbin chunk detected
Mar 05 20:07:04 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
Mar 05 20:07:04 elcap1 systemd[1]: flux.service: Failed with result 'signal'.
garlick commented 8 months ago

Another crash, this one immediately preceded by a bunch of imp kill errors.

Mar 07 14:44:24 elcap1 flux[4035212]: job-exec.info[0]: elcap212 (rank 72): imp kill: flux-imp: Fatal: kill: failed to initialize pid info: No such file or directory
Mar 07 14:44:24 elcap1 flux[4035212]: malloc_consolidate(): unaligned fastbin chunk detected
Mar 07 14:44:31 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT

No core - my changes to the core dump limits were reverted so it was truncated again.

grondo commented 8 months ago

I was able to reproduce memory corruption (though not on the rank 0 broker) by running a 512 broker instance across 64 nodes with a flat tbon topology -Stbon.topo=kary:512, e.g. e.g.

$  flux run -o exit-timeout=none -o pty.interactive -N64 -n 512 src/cmd/flux start -o,-Stbon.topo=kary:800 flux submit --cc=1-8000 --watch --progress --jps -n4 hostname

Will reproduce memory corruption on random brokers (though not rank 0 in my testing). I got two crashes from the above workload before it finished:

flux-job: task(s) exited with exit code 139
malloc_consolidate(): invalid chunk size
^Mfree(): corrupted unsorted chunks

The stack traces, though these were likely victims and not the cause:

(gdb) where
#0  0x0000155553dedacf in raise () from /lib64/libc.so.6
#1  0x0000155553dc0ea5 in abort () from /lib64/libc.so.6
#2  0x0000155553e2ecc7 in __libc_message () from /lib64/libc.so.6
#3  0x0000155553e35fcc in malloc_printerr () from /lib64/libc.so.6
#4  0x0000155553e38100 in _int_free () from /lib64/libc.so.6
#5  0x000000000042389e in flux_msg_destroy (msg=0x547710) at message.c:118
#6  flux_msg_destroy (msg=msg@entry=0x547710) at message.c:111
#7  0x00001555554e7d6a in handle_cb (r=<optimized out>, hw=<optimized out>, 
    revents=<optimized out>, arg=<optimized out>) at msg_handler.c:540
#8  0x0000155555511333 in ev_invoke_pending (loop=0x155530002ca0) at ev.c:3770
#9  0x0000155555514c05 in ev_run (flags=0, loop=0x155530002ca0) at ev.c:4190
#10 ev_run (loop=0x155530002ca0, flags=0) at ev.c:4021
#11 0x00001555554e713b in flux_reactor_run (r=0x155530002c80, 
    flags=flags@entry=0) at reactor.c:124
#12 0x000015555544ab73 in mod_main (h=0x155530000be0, argc=<optimized out>, 
    argv=<optimized out>) at kvs.c:3049
#13 0x00000000004109f6 in module_thread (arg=0x4e8220) at module.c:208
#14 0x000015555510d1ca in start_thread () from /lib64/libpthread.so.0
#15 0x0000155553dd8e73 in clone () from /lib64/libc.so.6
(gdb) where
#0  fzhashx_next (self=0x4dc3f0) at zhashx.c:613
#1  0x00000000004489bd in fzhashx_first (self=<optimized out>) at zhashx.c:584
#2  0x000000000044485e in subhash_topic_match (sh=0x4dc3c0, 
    topic=0x55ce00 "kvs.namespace-primary-setroot") at subhash.c:103
#3  0x0000000000411811 in module_event_cast (p=0x4dc0f0, 
    msg=msg@entry=0x539c40) at module.c:635
#4  0x0000000000411db8 in modhash_event_mcast (mh=0x48e910, 
    msg=msg@entry=0x539c40) at modhash.c:191
#5  0x000000000040e660 in handle_event (ctx=ctx@entry=0x7fffffffa140, 
    msg=0x539c40) at broker.c:1816
#6  0x000000000040fe1b in overlay_recv_cb (msg=0x7fffffff9d00, 
    where=OVERLAY_UPSTREAM, arg=0x7fffffffa140) at broker.c:1747
#7  0x0000000000413aed in parent_cb (r=<optimized out>, w=<optimized out>, 
    revents=<optimized out>, arg=0x4ace00) at overlay.c:1061
#8  0x0000000000420f17 in check_cb (loop=0x15555554a180 <default_loop_struct>, 
    w=0x4b56e8, revents=<optimized out>) at ev_zmq.c:79
#9  0x0000155555511333 in ev_invoke_pending (
    loop=0x15555554a180 <default_loop_struct>) at ev.c:3770
#10 0x0000155555514c05 in ev_run (flags=0, 
    loop=0x15555554a180 <default_loop_struct>) at ev.c:4190
#11 ev_run (loop=0x15555554a180 <default_loop_struct>, flags=0) at ev.c:4021
#12 0x00001555554e713b in flux_reactor_run (r=0x492970, flags=flags@entry=0)
    at reactor.c:124
#13 0x000000000040dd83 in main (argc=<optimized out>, argv=<optimized out>)
    at broker.c:532
grondo commented 8 months ago

This might be a false alarm. I noticed that all the memory corruption crashes were on the same node:

$ ls *.core
corona265-flux-broker-425-219062.core  corona265-flux-broker-430-245463.core
corona265-flux-broker-426-245426.core  corona265-flux-broker-431-225521.core
corona265-flux-broker-428-232894.core  corona265-flux-broker-431-245473.core

When I excluded that node from my tests, I was able to run several times and up to 756 brokers without error. :confused:

grondo commented 8 months ago

This may have been fixed by #5803, but we've not been able to get a non-truncated core file on that system yet, so we can't be sure.

Shall we close this issue and open a new one if we start seeing crashes on el cap with the newest flux-core?

garlick commented 8 months ago

Sounds good to me (I'll go ahead and close).