flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

Flux crashes in Jansson's unpack on UQP ensemble #2500

Closed dongahn closed 4 years ago

dongahn commented 4 years ago

@jameshcorbett integrate Flux into UQP and ran it w/ a large LLNL application. An ensemble ran for awhile (~10mins) and failed because a Flux broker took SIGSEGV and crashes. The backtrace is:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/libexec/flu'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.53.0-27.el7.x86_64 boost-graph-1.53.0-27.el7.x86_64 boost-regex-1.53.0-27.el7.x86_64 boost-system-1.53.0-27.el7.x86_64 czmq-3.0.2-3.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 jansson-2.10-1.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.1chaos.ch6.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libicu-50.2-3.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libsodium-1.0.18-1.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 libuuid-2.23.2-61.el7.x86_64 lz4-1.7.5-3.el7.x86_64 munge-libs-0.5.13-1.ch6.x86_64 numactl-libs-2.0.12-3.el7.x86_64 openpgm-5.2.122-2.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 yaml-cpp-0.5.1-2.el7.x86_64 zeromq-4.1.5-3.ch6.x86_64
(gdb) bt
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf4e743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf4e93b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac0935a0c in treeobj_peek (obj=<optimized out>, typep=0x2aaac0d73708, datap=0x0) at treeobj.c:57
#4  0x00002aaac0935d65 in treeobj_get_type (obj=<optimized out>) at treeobj.c:134
#5  0x00002aaac0935e99 in treeobj_is_dirref (obj=<optimized out>) at treeobj.c:165
#6  0x00002aaac092fa79 in lookup (lh=0x2aab00593dd0) at lookup.c:1057
#7  0x00002aaac092ba5c in lookup_common (h=h@entry=0x2aaac80011c0, mh=0x2aaac800db20, msg=msg@entry=0x2aaacb751bc0,
    arg=0x2aaac800b8b0, replay_cb=replay_cb@entry=0x2aaac092cb40 <lookup_plus_request_cb>, stall=<optimized out>)
    at kvs.c:1388
#8  0x00002aaac092cb6a in lookup_plus_request_cb (h=0x2aaac80011c0, mh=<optimized out>, msg=0x2aaacb751bc0,
    arg=<optimized out>) at kvs.c:1503
#9  0x00002aaac092ed1b in wait_runone (w=0x2aaac84a65d0) at waitqueue.c:173
#10 wait_runqueue (q=<optimized out>) at waitqueue.c:201
#11 0x00002aaac092e11c in cache_entry_set_raw (entry=entry@entry=0x2aaac8673f60, data=0x2aaacb50c458, len=286)
    at cache.c:179
#12 0x00002aaac092c4b1 in content_load_completion (f=0x2aaac86744a0, arg=0x2aaac800b8b0) at kvs.c:531
#13 0x00002aaaaad179d5 in ev_invoke_pending (loop=0x2aaac800a4e0) at ev.c:3322
#14 0x00002aaaaad1ab47 in ev_run (loop=0x2aaac800a4e0, flags=0) at ev.c:3726
#15 0x00002aaaaaceb593 in flux_reactor_run (r=0x2aaac8001270, flags=flags@entry=0) at reactor.c:126
#16 0x00002aaac092db01 in mod_main (h=0x2aaac80011c0, argc=<optimized out>, argv=<optimized out>) at kvs.c:2984
#17 0x000000000040abf6 in module_thread (arg=0x672360) at module.c:162
#18 0x00002aaaab678ea5 in start_thread (arg=0x2aaac0d74700) at pthread_create.c:307
#19 0x00002aaaac2ec8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) quit

The core file size is 435MB.

dongahn commented 4 years ago

@jameshcorbett: yes that should be very useful for us.

chu11 commented 4 years ago

Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.

This corrupted json_t is mostly likely an object inside another json object. So without walking the list, I can't determine if the outer json_t is corrupted, or if the data structure holding the outer json_t is corrupted.

I haven't been able to find an obvious use-after-free in any part of this area related to any of these possibilities.

jameshcorbett commented 4 years ago

@dongahn I just gave you the new core and log files.

dongahn commented 4 years ago

@jameshcorbett: you should also have issue2500.log from Flux instance. Could you send that to me as well? Hopefully the corefile came from the outer flux instance.

jameshcorbett commented 4 years ago

You should have it already.

jameshcorbett commented 4 years ago

Sorry, I only remembered it after I wrote my comment.

dongahn commented 4 years ago

@chu11: I gave you the log file.

quartz1922{dahn}22: give achu issue2500.log
Giving issue2500.log.
Gave 1 file(s).
dongahn commented 4 years ago

@jameshcorbett: I also noticed that sched-simple was used instead of flux-sched (our fully feature scheduler). Once this debugging dust settles, I will reconfigured the installation so that you can use flux-sched instead.

2019-11-01T23:27:42.862312Z job-manager.debug[0]: submit_cb: added 3 jobs
2019-11-01T23:27:42.864734Z sched-simple.debug[0]: req: 264140488704: spec={1,16,1}
2019-11-01T23:27:42.864929Z sched-simple.debug[0]: alloc: 264140488704: rank0/core[0-15]
2019-11-01T23:27:42.874414Z job-manager.debug[0]: submit_cb: added 2 jobs
2019-11-01T23:27:42.881216Z sched-simple.debug[0]: req: 264224374784: spec={1,16,1}
2019-11-01T23:27:42.881394Z sched-simple.debug[0]: alloc: 264224374784: rank1/core[0-15]
2019-11-01T23:27:42.901526Z sched-simple.debug[0]: req: 264308260864: spec={1,16,1}
2019-11-01T23:27:42.901733Z sched-simple.debug[0]: alloc: 264308260864: rank2/core[0-15]
2019-11-01T23:27:42.907812Z kvs.debug[0]: aggregated 2 transactions (2 ops)
2019-11-01T23:27:42.913110Z sched-simple.debug[0]: req: 264375369728: spec={1,16,1}
2019-11-01T23:27:42.913327Z sched-simple.debug[0]: alloc: 264375369728: rank3/core[0-15]
2019-11-01T23:27:42.923549Z sched-simple.debug[0]: req: 264476033024: spec={1,16,1}
2019-11-01T23:27:42.923743Z sched-simple.debug[0]: alloc: 264476033024: rank0/core[16-31]
dongahn commented 4 years ago

/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017 has the flux-sched.

It is just that /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg didn't have it.

dongahn commented 4 years ago

@grondo, @garlick: I tried the in-tree flux and it fails on me. I talked to @jameshcorbett and the same command works on an old session but not on the new session (newly ssh into rztopaz).

quartz764{dahn}34: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux start -s 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
lt-flux-broker: zsecurity_comms_init: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux: mode should be 0700
lt-flux-broker: overlay_bind failed: Operation not permitted
lt-flux-broker: bootstrap failed
flux-start: 0 (pid 67954) exited with rc=1

The permission is

rztopaz578{dahn}40: ll -d /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux
drwxrwsr-x 4 dahn tools 4096 Oct 31 11:26 /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux/

Do you know what might be causing this?

dongahn commented 4 years ago

Sorry. I edited the path to the in-tree flux cmd above.

chu11 commented 4 years ago

Nothing in the log KVS wise that looked too interesting to me, but noticed this:

2019-11-01T23:39:56.116152Z kvs.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.129649Z kvs.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:56.320573Z broker.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.154146Z kvs.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.582265Z broker.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:57.296304Z broker.debug[0]: content purge: 205 entries
2019-11-01T23:39:57.304147Z broker.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.203997Z kvs.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:39:56.255888Z kvs.debug[0]: aggregated 162 transactions (162 ops)
2019-11-01T23:39:58.227132Z kvs.debug[0]: aggregated 204 transactions (204 ops)
2019-11-01T23:39:58.905785Z broker.debug[0]: content purge: 228 entries
2019-11-01T23:40:00.405663Z kvs.info[0]: tagpool-normal expanded from 16384 to 32768 entries
2019-11-01T23:40:01.163255Z broker.debug[0]: content purge: 692 entries
2019-11-01T23:40:00.566211Z kvs.debug[0]: aggregated 207 transactions (207 ops)
2019-11-01T23:40:03.756036Z broker.debug[0]: content purge: 149 entries
2019-11-01T23:40:05.172796Z broker.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:40:05.575580Z broker.debug[0]: content purge: 1262 entries
2019-11-01T23:40:03.705158Z kvs.info[0]: tagpool-normal expanded from 32768 to 65536 entries
2019-11-01T23:40:08.664350Z broker.debug[0]: content purge: 280 entries
2019-11-01T23:40:03.842327Z kvs.debug[0]: aggregated 212 transactions (212 ops)

Perhaps it's ok, but it seems like a huge exponential growth in tags over about a 0.6 second span.

chu11 commented 4 years ago

@dongahn So I am sort of a the limit of what I can debug with gdb and the current core. Basically, there is a data structure inside the zlist, which holds a pointer to a json KVS directory object. Inside this json object, there are json objects of entries within the directory. It is one of these directory entries that is the corrupted json object in our trace above. The json KVS directory object is owned by yet another data structure in the KVS (cache object). It's hard to get a clue on what's going on without seeing what "parent object" might be corrupted.

Two ideas:

Of the two above, I would maybe prefer the latter, b/c I'm not 100% how I could get to some pointers even if I had access to the zlist (sha hash against KVS data, not sure what the data is b/c json object corrupted, so can't get the cache entry of that KVS data).

dongahn commented 4 years ago

@chu11: Let's discuss this a bit offline. We need to involve @jameshcorbett as well since until we will find a simpler reproducer we will need his support. Maybe tomorrow some time?

jameshcorbett commented 4 years ago

I'm happy to continue running problems--it doesn't take that much of my time. I can also capture my environment from that old rztopaz session if that would be helpful, @dongahn.

dongahn commented 4 years ago

@chu11: Oh btw, do we want to run this with valgrind to see if there is any obvious? I will see if I can run a flux instance under memcheck_all. We will probably need to use a suppression file if Flux already has one.

SteVwonder commented 4 years ago

We will probably need to use a suppression file if Flux already has one.

Yep. We have one as part of our CI tests: https://github.com/flux-framework/flux-core/blob/008ad823bc21683b32918e6bccbc5c5a5ecdc853/t/valgrind/valgrind.supp

dongahn commented 4 years ago

Nice!

chu11 commented 4 years ago

Oh btw, do we want to run this with valgrind to see if there is any obvious?

I suppose that couldn't hurt to try.

dongahn commented 4 years ago

@jameshcorbett:

I tried valgrind memcheck_all on my flux instance on rzgenie in the following way and it worked.

$ srun -N 2 -n 2 --pty /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/bin/flux start --wrap=/usr/tce/bin/memcheck_all

rzgenie22{dahn}21: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
510430019584
rzgenie22{dahn}22: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
619163156480
rzgenie22{dahn}23: flux mini run -N 2 -n 2 ./virtual_ring_mpi
memcheckview flux-broker.1640811_3.*.mc

Unfortunately memcheck_all does not take additional valgrind options like --suppressions so I was unable to suppress all the false positives. But still it should still be worth trying it this way.

Notice that I didn't have --mpi=none in the srun option because the memcheck_all tool requires MPI_RANK envVariables.

Could you run your crashing program with memcheck_all to see if any apparent memory corruption can be detected?

Note that we have to expect this will slow down the execution by a factor of at least 10x. So please adjust your walltime limit.

Thanks.

jameshcorbett commented 4 years ago

@dongahn it's running now, and I gave it a fifteen-hour time limit.

dongahn commented 4 years ago

Thanks @jameshcorbett.

I suspect valgrind spotted the root cause. This appears to me as a case of read-after-free. There are many instances of "Invalid read of" varying sizes.

Invalid read of size 4                                             x1 (in /usr/lib64/libjansson.so.4.10.0)
[MPI Rank 0 only]
  at 0x50C63AA: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C7742: json_vunpack_ex (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C793A: json_unpack (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA91B682: treeobj_peek (treeobj.c:57)
  by 0xA91B992: treeobj_get_type (treeobj.c:134)
  by 0xA91BAE7: treeobj_is_dirref (treeobj.c:165)
  by 0xA914487: lookup (lookup.c:1057)
  by 0xA90D389: lookup_common (kvs.c:1388)
  by 0xA90D785: lookup_plus_request_cb (kvs.c:1503)
  by 0xA9124FA: wait_runone (waitqueue.c:173)
  by 0xA9125AA: wait_runqueue (waitqueue.c:201)
  by 0xA91160F: cache_entry_set_raw (cache.c:179)
  by 0xA90B482: content_load_completion (kvs.c:531)
  by 0x4E5F9A2: check_cb (future.c:796)
  by 0x4E53FBE: check_cb (reactor.c:853)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
  by 0x4E52716: flux_reactor_run (reactor.c:126)
  by 0xA911169: mod_main (kvs.c:2984)
  by 0x40BF75: module_thread (module.c:162)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'd
  at 0x4C2AF7D: free (vg_replace_malloc.c:530)
  by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA9111ED: json_decref (jansson.h:112)
  by 0xA91185F: cache_entry_destroy (cache.c:253)
  by 0xA911EBF: cache_entry_destroy_wrapper (cache.c:433)
  by 0x52F3F11: (in /usr/lib64/libczmq.so.3.0.0)
  by 0xA911C1C: cache_expire_entries (cache.c:359)
  by 0xA90CF6C: heartbeat_cb (kvs.c:1273)
  by 0x4E54F57: call_handler (msg_handler.c:236)
  by 0x4E550E4: dispatch_message (msg_handler.c:282)
  by 0x4E5535F: handle_cb (msg_handler.c:374)
  by 0x4E52C45: handle_cb (reactor.c:299)
  by 0x4E5D1CD: check_cb (ev_flux.c:58)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'dBlock was alloc'd at
  at 0x4C29E83: malloc (vg_replace_malloc.c:299)
  by 0x50C819A: json_object (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C50BC: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C54F5: (in /usr/lib64/libjansson.so.4.10.0)
  by 0x50C57AD: json_loadb (in /usr/lib64/libjansson.so.4.10.0)
  by 0xA91C8D1: treeobj_decodeb (treeobj.c:590)
  by 0xA9117FE: cache_entry_get_treeobj (cache.c:242)
  by 0xA90BF28: setroot_event_send (kvs.c:837)
  by 0xA90C7CC: kvstxn_apply (kvs.c:1071)
  by 0xA9124C1: wait_runone (waitqueue.c:171)
  by 0xA9125AA: wait_runqueue (waitqueue.c:201)
  by 0xA9113A4: cache_entry_set_dirty (cache.c:101)
  by 0xA90B98F: content_store_completion (kvs.c:686)
  by 0x4E5F9A2: check_cb (future.c:796)
  by 0x4E53FBE: check_cb (reactor.c:853)
  by 0x4E90D92: ev_invoke_pending (ev.c:3322)
  by 0x4E91C84: ev_run (ev.c:3726)
  by 0x4E52716: flux_reactor_run (reactor.c:126)

While there are various accesses at different points of execution which are marked as "invalid read", many seem to point to a premature "free".

Maybe the conditional statement that leads to zlistx_delete has a logic bug?

https://github.com/flux-framework/flux-core/blob/master/src/modules/kvs/cache.c#L359

dongahn commented 4 years ago

It seems that there are also various leaks that memcheck_all found. These are probably the ones that our valgrind unit test couldn't seee. I gave these files to @chu11.

rztopaz578{dahn}36: give -l
 achu has been given:
     8 MB Nov 05 18:49 flux-broker.2963686_0.0.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.1.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.2.mc
     1 KB Nov 05 18:49 flux-broker.2963686_0.3.mc
   345 MB Nov 01 10:23 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:23 rztopaz484-flux-broker-0-48886.core
   208 MB Nov 05 18:49 vgcore.69924
   7 file(s)
grondo commented 4 years ago

Nice find, @dongahn!

dongahn commented 4 years ago

Thanks @grondo. It was actually James who did the bulk lifting. And @chu11 pointed me to --wrap option. Thanks James and Al.

Hopefully this can lead to a quick fix...(don't know how deep the bug is though).

chu11 commented 4 years ago

@dongahn great find, that definitely gets me looking in a specific area.

Yeah, it appears that a cache entry is getting expired before it is populated, which should be impossible. Will have to look into this.

chu11 commented 4 years ago

Ok, I think I see a path where a lookup assumes that a cache entry has not been expired, but it could have been. Couldn't run via reproducer tonight, will try tomorrow.

diff --git a/src/modules/kvs/lookup.c b/src/modules/kvs/lookup.c
index 4119e96..ed82787 100644
--- a/src/modules/kvs/lookup.c
+++ b/src/modules/kvs/lookup.c
@@ -85,7 +85,7 @@ struct lookup {

     /* API internal */
     zlist_t *levels;
-    const json_t *wdirent;       /* result after walk() */
+    json_t *wdirent;       /* result after walk() */
     enum {
         LOOKUP_STATE_INIT,
         LOOKUP_STATE_CHECK_NAMESPACE,
@@ -487,7 +487,7 @@ static lookup_process_t walk (lookup_t *lh)

         zlist_remove (wl->pathcomps, pathcomp);
     }
-    lh->wdirent = wl->dirent;
+    lh->wdirent = json_incref ((json_t *)wl->dirent);

 done:
     return LOOKUP_PROCESS_FINISHED;
@@ -590,6 +590,7 @@ void lookup_destroy (lookup_t *lh)
         json_decref (lh->val);
         free (lh->missing_namespace);
         zlist_destroy (&lh->levels);
+        json_decref (lh->wdirent);
         free (lh);
     }
 }

Am obviously breaking const-ness, gotta figure out what would be better than this later.

chu11 commented 4 years ago

tested with above patch and got a different crash than before. I think it is the same logical problem, but in a different place. Just got to think about how to deal with the problem more generically.

#0  0x00002aaaaaf5f8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf60743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf6093b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac094178e in treeobj_peek (obj=0x2aaaf8914d00, typep=0x2aaac0d894f8, datap=0x0) at treeobj.c:57
#4  0x00002aaac0941a9e in treeobj_get_type (obj=0x2aaaf8914d00) at treeobj.c:134
#5  0x00002aaac0941bf3 in treeobj_is_dirref (obj=0x2aaaf8914d00) at treeobj.c:165
#6  0x00002aaac0939040 in walk (lh=0x2aaaf31ad9c0) at lookup.c:362
#7  0x00002aaac093a4c6 in lookup (lh=0x2aaaf31ad9c0) at lookup.c:1028
#8  0x00002aaac093347a in lookup_common (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0, 
    replay_cb=0x2aaac093382c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#9  0x00002aaac0933876 in lookup_plus_request_cb (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0) at kvs.c:1503
#10 0x00002aaac09385eb in wait_runone (w=0x2aaaf0880180) at waitqueue.c:173
#11 0x00002aaac093869b in wait_runqueue (q=0x2aaaf6c01e40) at waitqueue.c:201
#12 0x00002aaac0937700 in cache_entry_set_raw (entry=0x2aaaf9f08d80, data=0x2aaaf955b708, len=365) at cache.c:179
#13 0x00002aaac0931573 in content_load_completion (f=0x2aaaf7ce7eb0, arg=0x2aaac800b8a0) at kvs.c:531
#14 0x00002aaaaacf88cf in check_cb (r=0x2aaac8009a50, w=0x2aaaf9e954a0, revents=0, arg=0x2aaaf7ce7eb0) at future.c:796
#15 0x00002aaaaacec3d1 in check_cb (loop=0x2aaac800a4f0, cw=0x2aaaf9e954c8, revents=32768) at reactor.c:853
#16 0x00002aaaaad29d59 in ev_invoke_pending (loop=0x2aaac800a4f0) at ev.c:3322
#17 0x00002aaaaad2ac4b in ev_run (loop=0x2aaac800a4f0, flags=0) at ev.c:3726
#18 0x00002aaaaaceab29 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#19 0x00002aaac093725a in mod_main (h=0x2aaac8001160, argc=0, argv=0x2aaac800b3c0) at kvs.c:2984
#20 0x000000000040bfe1 in module_thread (arg=0x679ca0) at module.c:162
#21 0x00002aaaab68aea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#22 0x00002aaaac2fe8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
dongahn commented 4 years ago

Closing this. If a problem pops out again, we can reopen.