Closed dongahn closed 4 years ago
@jameshcorbett: yes that should be very useful for us.
Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.
This corrupted json_t
is mostly likely an object inside another json object. So without walking the list, I can't determine if the outer json_t
is corrupted, or if the data structure holding the outer json_t
is corrupted.
I haven't been able to find an obvious use-after-free in any part of this area related to any of these possibilities.
@dongahn I just gave you the new core and log files.
@jameshcorbett: you should also have issue2500.log
from Flux instance. Could you send that to me as well? Hopefully the corefile came from the outer flux instance.
You should have it already.
Sorry, I only remembered it after I wrote my comment.
@chu11: I gave you the log file.
quartz1922{dahn}22: give achu issue2500.log
Giving issue2500.log.
Gave 1 file(s).
@jameshcorbett: I also noticed that sched-simple
was used instead of flux-sched (our fully feature scheduler). Once this debugging dust settles, I will reconfigured the installation so that you can use flux-sched instead.
2019-11-01T23:27:42.862312Z job-manager.debug[0]: submit_cb: added 3 jobs
2019-11-01T23:27:42.864734Z sched-simple.debug[0]: req: 264140488704: spec={1,16,1}
2019-11-01T23:27:42.864929Z sched-simple.debug[0]: alloc: 264140488704: rank0/core[0-15]
2019-11-01T23:27:42.874414Z job-manager.debug[0]: submit_cb: added 2 jobs
2019-11-01T23:27:42.881216Z sched-simple.debug[0]: req: 264224374784: spec={1,16,1}
2019-11-01T23:27:42.881394Z sched-simple.debug[0]: alloc: 264224374784: rank1/core[0-15]
2019-11-01T23:27:42.901526Z sched-simple.debug[0]: req: 264308260864: spec={1,16,1}
2019-11-01T23:27:42.901733Z sched-simple.debug[0]: alloc: 264308260864: rank2/core[0-15]
2019-11-01T23:27:42.907812Z kvs.debug[0]: aggregated 2 transactions (2 ops)
2019-11-01T23:27:42.913110Z sched-simple.debug[0]: req: 264375369728: spec={1,16,1}
2019-11-01T23:27:42.913327Z sched-simple.debug[0]: alloc: 264375369728: rank3/core[0-15]
2019-11-01T23:27:42.923549Z sched-simple.debug[0]: req: 264476033024: spec={1,16,1}
2019-11-01T23:27:42.923743Z sched-simple.debug[0]: alloc: 264476033024: rank0/core[16-31]
/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017
has the flux-sched.
It is just that /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg
didn't have it.
@grondo, @garlick: I tried the in-tree flux and it fails on me. I talked to @jameshcorbett and the same command works on an old session but not on the new session (newly ssh into rztopaz).
quartz764{dahn}34: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux start -s 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
lt-flux-broker: zsecurity_comms_init: /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux: mode should be 0700
lt-flux-broker: overlay_bind failed: Operation not permitted
lt-flux-broker: bootstrap failed
flux-start: 0 (pid 67954) exited with rc=1
The permission is
rztopaz578{dahn}40: ll -d /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux
drwxrwsr-x 4 dahn tools 4096 Oct 31 11:26 /usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/etc/flux/
Do you know what might be causing this?
Sorry. I edited the path to the in-tree flux cmd above.
Nothing in the log KVS wise that looked too interesting to me, but noticed this:
2019-11-01T23:39:56.116152Z kvs.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.129649Z kvs.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:56.320573Z broker.info[0]: tagpool-normal expanded from 1024 to 2048 entries
2019-11-01T23:39:56.154146Z kvs.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.582265Z broker.info[0]: tagpool-normal expanded from 2048 to 4096 entries
2019-11-01T23:39:57.296304Z broker.debug[0]: content purge: 205 entries
2019-11-01T23:39:57.304147Z broker.info[0]: tagpool-normal expanded from 4096 to 8192 entries
2019-11-01T23:39:56.203997Z kvs.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:39:56.255888Z kvs.debug[0]: aggregated 162 transactions (162 ops)
2019-11-01T23:39:58.227132Z kvs.debug[0]: aggregated 204 transactions (204 ops)
2019-11-01T23:39:58.905785Z broker.debug[0]: content purge: 228 entries
2019-11-01T23:40:00.405663Z kvs.info[0]: tagpool-normal expanded from 16384 to 32768 entries
2019-11-01T23:40:01.163255Z broker.debug[0]: content purge: 692 entries
2019-11-01T23:40:00.566211Z kvs.debug[0]: aggregated 207 transactions (207 ops)
2019-11-01T23:40:03.756036Z broker.debug[0]: content purge: 149 entries
2019-11-01T23:40:05.172796Z broker.info[0]: tagpool-normal expanded from 8192 to 16384 entries
2019-11-01T23:40:05.575580Z broker.debug[0]: content purge: 1262 entries
2019-11-01T23:40:03.705158Z kvs.info[0]: tagpool-normal expanded from 32768 to 65536 entries
2019-11-01T23:40:08.664350Z broker.debug[0]: content purge: 280 entries
2019-11-01T23:40:03.842327Z kvs.debug[0]: aggregated 212 transactions (212 ops)
Perhaps it's ok, but it seems like a huge exponential growth in tags over about a 0.6 second span.
@dongahn So I am sort of a the limit of what I can debug with gdb and the current core. Basically, there is a data structure inside the zlist
, which holds a pointer to a json KVS directory object. Inside this json object, there are json objects of entries within the directory. It is one of these directory entries that is the corrupted json object in our trace above. The json KVS directory object is owned by yet another data structure in the KVS (cache object). It's hard to get a clue on what's going on without seeing what "parent object" might be corrupted.
Two ideas:
zlist
. Of course, with changes, who knows if it'll reproduce.Of the two above, I would maybe prefer the latter, b/c I'm not 100% how I could get to some pointers even if I had access to the zlist (sha hash against KVS data, not sure what the data is b/c json object corrupted, so can't get the cache entry of that KVS data).
@chu11: Let's discuss this a bit offline. We need to involve @jameshcorbett as well since until we will find a simpler reproducer we will need his support. Maybe tomorrow some time?
I'm happy to continue running problems--it doesn't take that much of my time. I can also capture my environment from that old rztopaz session if that would be helpful, @dongahn.
@chu11: Oh btw, do we want to run this with valgrind to see if there is any obvious? I will see if I can run a flux instance under memcheck_all. We will probably need to use a suppression file if Flux already has one.
We will probably need to use a suppression file if Flux already has one.
Yep. We have one as part of our CI tests: https://github.com/flux-framework/flux-core/blob/008ad823bc21683b32918e6bccbc5c5a5ecdc853/t/valgrind/valgrind.supp
Nice!
Oh btw, do we want to run this with valgrind to see if there is any obvious?
I suppose that couldn't hurt to try.
@jameshcorbett:
I tried valgrind memcheck_all
on my flux instance on rzgenie in the following way and it worked.
$ srun -N 2 -n 2 --pty /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/bin/flux start --wrap=/usr/tce/bin/memcheck_all
rzgenie22{dahn}21: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
510430019584
rzgenie22{dahn}22: flux jobspec srun -N 2 -n 2 ./virtual_ring_mpi | flux job submit -
619163156480
rzgenie22{dahn}23: flux mini run -N 2 -n 2 ./virtual_ring_mpi
memcheckview flux-broker.1640811_3.*.mc
Unfortunately memcheck_all
does not take additional valgrind options like --suppressions
so I was unable to suppress all the false positives. But still it should still be worth trying it this way.
Notice that I didn't have --mpi=none
in the srun option because the memcheck_all
tool requires MPI_RANK
envVariables.
Could you run your crashing program with memcheck_all
to see if any apparent memory corruption can be detected?
Note that we have to expect this will slow down the execution by a factor of at least 10x. So please adjust your walltime limit.
Thanks.
@dongahn it's running now, and I gave it a fifteen-hour time limit.
Thanks @jameshcorbett.
I suspect valgrind spotted the root cause. This appears to me as a case of read-after-free. There are many instances of "Invalid read of" varying sizes.
Invalid read of size 4 x1 (in /usr/lib64/libjansson.so.4.10.0)
[MPI Rank 0 only]
at 0x50C63AA: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C7742: json_vunpack_ex (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C793A: json_unpack (in /usr/lib64/libjansson.so.4.10.0)
by 0xA91B682: treeobj_peek (treeobj.c:57)
by 0xA91B992: treeobj_get_type (treeobj.c:134)
by 0xA91BAE7: treeobj_is_dirref (treeobj.c:165)
by 0xA914487: lookup (lookup.c:1057)
by 0xA90D389: lookup_common (kvs.c:1388)
by 0xA90D785: lookup_plus_request_cb (kvs.c:1503)
by 0xA9124FA: wait_runone (waitqueue.c:173)
by 0xA9125AA: wait_runqueue (waitqueue.c:201)
by 0xA91160F: cache_entry_set_raw (cache.c:179)
by 0xA90B482: content_load_completion (kvs.c:531)
by 0x4E5F9A2: check_cb (future.c:796)
by 0x4E53FBE: check_cb (reactor.c:853)
by 0x4E90D92: ev_invoke_pending (ev.c:3322)
by 0x4E91C84: ev_run (ev.c:3726)
by 0x4E52716: flux_reactor_run (reactor.c:126)
by 0xA911169: mod_main (kvs.c:2984)
by 0x40BF75: module_thread (module.c:162)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'd
at 0x4C2AF7D: free (vg_replace_malloc.c:530)
by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C38B3: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C3958: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C8990: json_delete (in /usr/lib64/libjansson.so.4.10.0)
by 0xA9111ED: json_decref (jansson.h:112)
by 0xA91185F: cache_entry_destroy (cache.c:253)
by 0xA911EBF: cache_entry_destroy_wrapper (cache.c:433)
by 0x52F3F11: (in /usr/lib64/libczmq.so.3.0.0)
by 0xA911C1C: cache_expire_entries (cache.c:359)
by 0xA90CF6C: heartbeat_cb (kvs.c:1273)
by 0x4E54F57: call_handler (msg_handler.c:236)
by 0x4E550E4: dispatch_message (msg_handler.c:282)
by 0x4E5535F: handle_cb (msg_handler.c:374)
by 0x4E52C45: handle_cb (reactor.c:299)
by 0x4E5D1CD: check_cb (ev_flux.c:58)
by 0x4E90D92: ev_invoke_pending (ev.c:3322)
by 0x4E91C84: ev_run (ev.c:3726)
Address 0xea90d60 is 0 bytes inside a block of size 80 free'dBlock was alloc'd at
at 0x4C29E83: malloc (vg_replace_malloc.c:299)
by 0x50C819A: json_object (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C50BC: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C5169: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C54F5: (in /usr/lib64/libjansson.so.4.10.0)
by 0x50C57AD: json_loadb (in /usr/lib64/libjansson.so.4.10.0)
by 0xA91C8D1: treeobj_decodeb (treeobj.c:590)
by 0xA9117FE: cache_entry_get_treeobj (cache.c:242)
by 0xA90BF28: setroot_event_send (kvs.c:837)
by 0xA90C7CC: kvstxn_apply (kvs.c:1071)
by 0xA9124C1: wait_runone (waitqueue.c:171)
by 0xA9125AA: wait_runqueue (waitqueue.c:201)
by 0xA9113A4: cache_entry_set_dirty (cache.c:101)
by 0xA90B98F: content_store_completion (kvs.c:686)
by 0x4E5F9A2: check_cb (future.c:796)
by 0x4E53FBE: check_cb (reactor.c:853)
by 0x4E90D92: ev_invoke_pending (ev.c:3322)
by 0x4E91C84: ev_run (ev.c:3726)
by 0x4E52716: flux_reactor_run (reactor.c:126)
While there are various accesses at different points of execution which are marked as "invalid read", many seem to point to a premature "free".
Maybe the conditional statement that leads to zlistx_delete
has a logic bug?
https://github.com/flux-framework/flux-core/blob/master/src/modules/kvs/cache.c#L359
It seems that there are also various leaks that memcheck_all
found. These are probably the ones that our valgrind unit test couldn't seee. I gave these files to @chu11.
rztopaz578{dahn}36: give -l
achu has been given:
8 MB Nov 05 18:49 flux-broker.2963686_0.0.mc
1 KB Nov 05 18:49 flux-broker.2963686_0.1.mc
1 KB Nov 05 18:49 flux-broker.2963686_0.2.mc
1 KB Nov 05 18:49 flux-broker.2963686_0.3.mc
345 MB Nov 01 10:23 rztopaz290-flux-broker-0-22454.core
363 MB Nov 01 10:23 rztopaz484-flux-broker-0-48886.core
208 MB Nov 05 18:49 vgcore.69924
7 file(s)
Nice find, @dongahn!
Thanks @grondo. It was actually James who did the bulk lifting. And @chu11 pointed me to --wrap
option. Thanks James and Al.
Hopefully this can lead to a quick fix...(don't know how deep the bug is though).
@dongahn great find, that definitely gets me looking in a specific area.
Yeah, it appears that a cache entry is getting expired before it is populated, which should be impossible. Will have to look into this.
Ok, I think I see a path where a lookup assumes that a cache entry has not been expired, but it could have been. Couldn't run via reproducer tonight, will try tomorrow.
diff --git a/src/modules/kvs/lookup.c b/src/modules/kvs/lookup.c
index 4119e96..ed82787 100644
--- a/src/modules/kvs/lookup.c
+++ b/src/modules/kvs/lookup.c
@@ -85,7 +85,7 @@ struct lookup {
/* API internal */
zlist_t *levels;
- const json_t *wdirent; /* result after walk() */
+ json_t *wdirent; /* result after walk() */
enum {
LOOKUP_STATE_INIT,
LOOKUP_STATE_CHECK_NAMESPACE,
@@ -487,7 +487,7 @@ static lookup_process_t walk (lookup_t *lh)
zlist_remove (wl->pathcomps, pathcomp);
}
- lh->wdirent = wl->dirent;
+ lh->wdirent = json_incref ((json_t *)wl->dirent);
done:
return LOOKUP_PROCESS_FINISHED;
@@ -590,6 +590,7 @@ void lookup_destroy (lookup_t *lh)
json_decref (lh->val);
free (lh->missing_namespace);
zlist_destroy (&lh->levels);
+ json_decref (lh->wdirent);
free (lh);
}
}
Am obviously breaking const-ness, gotta figure out what would be better than this later.
tested with above patch and got a different crash than before. I think it is the same logical problem, but in a different place. Just got to think about how to deal with the problem more generically.
#0 0x00002aaaaaf5f8b7 in unpack () from /lib64/libjansson.so.4
#1 0x00002aaaaaf60743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2 0x00002aaaaaf6093b in json_unpack () from /lib64/libjansson.so.4
#3 0x00002aaac094178e in treeobj_peek (obj=0x2aaaf8914d00, typep=0x2aaac0d894f8, datap=0x0) at treeobj.c:57
#4 0x00002aaac0941a9e in treeobj_get_type (obj=0x2aaaf8914d00) at treeobj.c:134
#5 0x00002aaac0941bf3 in treeobj_is_dirref (obj=0x2aaaf8914d00) at treeobj.c:165
#6 0x00002aaac0939040 in walk (lh=0x2aaaf31ad9c0) at lookup.c:362
#7 0x00002aaac093a4c6 in lookup (lh=0x2aaaf31ad9c0) at lookup.c:1028
#8 0x00002aaac093347a in lookup_common (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0,
replay_cb=0x2aaac093382c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#9 0x00002aaac0933876 in lookup_plus_request_cb (h=0x2aaac8001160, mh=0x2aaac800b9e0, msg=0x2aaaf9e95470, arg=0x2aaac800b8a0) at kvs.c:1503
#10 0x00002aaac09385eb in wait_runone (w=0x2aaaf0880180) at waitqueue.c:173
#11 0x00002aaac093869b in wait_runqueue (q=0x2aaaf6c01e40) at waitqueue.c:201
#12 0x00002aaac0937700 in cache_entry_set_raw (entry=0x2aaaf9f08d80, data=0x2aaaf955b708, len=365) at cache.c:179
#13 0x00002aaac0931573 in content_load_completion (f=0x2aaaf7ce7eb0, arg=0x2aaac800b8a0) at kvs.c:531
#14 0x00002aaaaacf88cf in check_cb (r=0x2aaac8009a50, w=0x2aaaf9e954a0, revents=0, arg=0x2aaaf7ce7eb0) at future.c:796
#15 0x00002aaaaacec3d1 in check_cb (loop=0x2aaac800a4f0, cw=0x2aaaf9e954c8, revents=32768) at reactor.c:853
#16 0x00002aaaaad29d59 in ev_invoke_pending (loop=0x2aaac800a4f0) at ev.c:3322
#17 0x00002aaaaad2ac4b in ev_run (loop=0x2aaac800a4f0, flags=0) at ev.c:3726
#18 0x00002aaaaaceab29 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#19 0x00002aaac093725a in mod_main (h=0x2aaac8001160, argc=0, argv=0x2aaac800b3c0) at kvs.c:2984
#20 0x000000000040bfe1 in module_thread (arg=0x679ca0) at module.c:162
#21 0x00002aaaab68aea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#22 0x00002aaaac2fe8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Closing this. If a problem pops out again, we can reopen.
@jameshcorbett integrate Flux into UQP and ran it w/ a large LLNL application. An ensemble ran for awhile (~10mins) and failed because a Flux broker took
SIGSEGV
and crashes. The backtrace is:The core file size is 435MB.