Closed dongahn closed 4 years ago
Here is the version info:
rztopaz572{dahn}38: /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/bin/flux --version
commands: 0.13.0-83-g30ccef9
libflux-core: 0.13.0-83-g30ccef9
Looking at the corefile using totalview. It is the KVS module that's crashing.
TotalView's back trace is slightly different in that the leaf function that got the segmentation violation is called json_get_alloc_funcs ()
:
Too bad we don't have jansson debuginfo package installed. Not sure how we can investigate what might actually be in that treeobj json_t *
, or if obj
is even valid memory? Maybe @chu11 will have some ideas.
@grondo: I can build flux with (-O0 -g
) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson. If worse comes to the worse, we may need to build our own Jansson in full debug mode and to look into Jansson state.
@chu11: Let me know if you have an insight about this. Otherwise, I will go ahead and try this.
Unlikely that json_get_alloc_funcs
is where the segv is really occurring. This function just returns internally stored malloc
and free
(allowing users to override allocator used by jansson)
void json_get_alloc_funcs(json_malloc_t *malloc_fn, json_free_t *free_fn)
{
if (malloc_fn)
*malloc_fn = do_malloc;
if (free_fn)
*free_fn = do_free;
}
The other thing we need to look at is the memory footprint of the system when this happen. This was with a large app and depending on the input deck used, out of memory is possibility.
@jameshcorbett: do you know what is the memory usage on a node? type in top
on one of the allocated node should give you some info.
I can build flux with (-O0 -g) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson.
Great if the problem is reproducible!
@grondo: somehow I can't locate the unpack()
function in Jansson...
somehow I can't locate the unpack() function in Jansson...
In 2.8 it is in src/pack_unpack.c
@jameshcorbett: could you use the following version to see if this also crashes? If so, could you "give" the corefile to me again?
/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg
Sure @dongahn, do you want me to do anything with top
? Insert it into the script that launches flux?
FYI @dongahn, RedHat has a new project debuginfod which might be timely. If it works as advertised, we could have one debuginfo server per network zone with all debuginfos readily available
https://developers.redhat.com/blog/2019/10/14/introducing-debuginfod-the-elfutils-debuginfo-server/
I'll see if we have any hope of getting this set up on site. We may want to talk to Ben about STAT and TV support for remote debuginfos.
@dongahn the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash. I was running ten simulations, and five at a time. Of the first five running, two crashed with what I believe were internal errors. The other three ran to completion, but towards the end of the simulation's log file there was the line "flux-job: flux_job_event_watch_get: State not recoverable". The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.
When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.
the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash.
This is an unoptimized build of Flux and I wonder if that slowed things down further.
Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?
Of the first five running, two crashed with what I believe were internal errors
There are internal errors of the application itself, correct?
The other three ran to completion, but towards the end of the simulation's log file there was the line
flux-job: flux_job_event_watch_get: State not recoverable
.
@grondo: do you know when such an output will be printed out? Is this typically harmless?
The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.
Hmmm. This is bad... I wonder if this is just a follow up problem from the cause of "flux-job: flux_job_event_watch_get: State not recoverable".
When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.
This is helpful. I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: https://github.com/flux-framework/flux-core/pull/2468
I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: #2468
@jameshcorbett: Let's try one more thing which should be very useful given that you had a previous success.
Could you try flux from in-tree build directory?
/usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux
This was essentially how you ran your Flux before #2468. If this still gives you the errors, I will build one more Flux version with the exactly same configuration as before the #2468 patch.
This is an unoptimized build of Flux and I wonder if that slowed things down further.
Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?
Actually, the issue I had where Flux was running slowly was a completely separate issue. That only came up when I started to create nested flux instances. Otherwise, I haven't noticed Flux being any slower than srun.
There are internal errors of the application itself, correct?
Yeah, they pop up every once in a while even when I don't use flux at all, and just srun the application.
@dongahn I started to run the in-tree flux, it should be done in half an hour.
flux-job: flux_job_event_watch_get: State not recoverable.
@grondo: do you know when such an output will be printed out? Is this typically harmless?
No, AFAIK, we should never see ENOTRECOVERABLE
errors. Maybe more information would be in the flux dmesg
output, but all around it seems like something bad happened to this instance.
@dongahn the in-tree build flux worked. However, I did notice that whenever the application crashed (for what I still believe are internal reasons) flux mini run didn't return, so the UQP was never able to mark them as complete. Three applications crashed, so the UQP was left running the rest of the applications just two at a time (instead of five at a time).
@jameshcorbett, the 0.13+ version of flux has a deficiency where it will wait for all tasks to exit before considering the job complete (and flux-mini run exits). If only one task in your job was crashing, this might be what you are experiencing.
Sounds like we need to address this issue sooner rather than later if that is the case.
If all your tasks were exited at the time of the crash, then there is an unknown bug we'll have to investigate further.
@grondo I think that's it. While running the application with sixteen tasks with srun and flux mini run, here's the tail of the log file for each:
Srun:
0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 2943859.2 ON rztopaz609 CANCELLED AT 2019-10-29T12:50:09 ***
srun: error: rztopaz609: tasks 0-15: Killed
Flux:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
PMI2_Abort: (15) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
PMI2_Abort: (1) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
PMI2_Abort: (11) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
PMI2_Abort: (7) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
PMI2_Abort: (2) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
PMI2_Abort: (9) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
PMI2_Abort: (5) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
PMI2_Abort: (0) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
PMI2_Abort: (3) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
(hang)
In this particular case, it is strange to me that the rest of the MPI ranks don't abort, though.
In this case I think we could wire our PMI to generate a job exception on abort, which would terminate the job. However, if a task died for some other reason we'd still have the problem described above.
Sorry about the trouble! We'll work on improving it.
Oh, I see. I just expected the MPI_Abort call would have signaled all other MPI ranks to abort too, and that then Flux would have marked the job as complete.
I'm just happy that Flux isn't mysteriously crashing any more.
I'm just happy that Flux isn't mysteriously crashing any more.
Glad this worked but we really need to track down this issue further while you are making progress by using the in-tree flux.
I think some of flux's dependencies are missing when you use the installed flux. I have no idea yet how to track this down though.
I could be wrong but I think it is up to PMI to do something with the abort (though maybe some MPIs are wired up to be able to do something global with an MPI_Abort)
I really wish we understood what was crashing Flux in your reproducer. That issue is concerning!
Let me know if there's anything I can do to help. I'll re-run the simulations with the various flux versions just to confirm the results.
Yes. If you can run the installed version a couple more times to see if this can produce a core file (full debug), this should be extremely helpful. Thanks James!
You mean /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg
? Will do.
Yes! Thanks.
@dongahn I just gave you two new core files. I'm not sure why the first debug run didn't crash, because these two new runs I tried overnight crashed around fifteen minutes in. But anyway, I hope they help.
@jameshcorbett: oh great. I will take a look right away.
Seems Flux crashes at the same spot:
#0 0x00002aaaaaf5e8b7 in unpack () from /lib64/libjansson.so.4
#1 0x00002aaaaaf5f743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2 0x00002aaaaaf5f93b in json_unpack () from /lib64/libjansson.so.4
#3 0x00002aaac0941683 in treeobj_peek (obj=0x2aaac91d9610, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
#4 0x00002aaac0941993 in treeobj_get_type (obj=0x2aaac91d9610) at treeobj.c:134
#5 0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
#6 0x00002aaac093a488 in lookup (lh=0x2aaac8044260) at lookup.c:1057
#7 0x00002aaac093338a in lookup_common (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890,
replay_cb=0x2aaac093373c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#8 0x00002aaac0933786 in lookup_plus_request_cb (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890) at kvs.c:1503
#9 0x00002aaac09384fb in wait_runone (w=0x2aaac9edd130) at waitqueue.c:173
#10 0x00002aaac09385ab in wait_runqueue (q=0x2aaac9829220) at waitqueue.c:201
#11 0x00002aaac0937610 in cache_entry_set_raw (entry=0x2aaaf24e6f80, data=0x2aaacb69cb58, len=286) at cache.c:179
#12 0x00002aaac0931483 in content_load_completion (f=0x2aaac9829240, arg=0x2aaac800b890) at kvs.c:531
#13 0x00002aaaaacf79a3 in check_cb (r=0x2aaac8009a50, w=0x2aaac9829040, revents=0, arg=0x2aaac9829240) at future.c:796
#14 0x00002aaaaacebfbf in check_cb (loop=0x2aaac800a540, cw=0x2aaac9829068, revents=32768) at reactor.c:853
#15 0x00002aaaaad28d93 in ev_invoke_pending (loop=0x2aaac800a540) at ev.c:3322
#16 0x00002aaaaad29c85 in ev_run (loop=0x2aaac800a540, flags=0) at ev.c:3726
#17 0x00002aaaaacea717 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#18 0x00002aaac093716a in mod_main (h=0x2aaac8001220, argc=0, argv=0x2aaac800b3b0) at kvs.c:2984
#19 0x000000000040bf76 in module_thread (arg=0x67cf30) at module.c:162
#20 0x00002aaaab689ea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#21 0x00002aaaac2fd8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
json_t *
is first passed into treeobj_is_dirref
and its type
field is 4056313424 and refcount
46913661567584. Both seem pretty large and this makes me believe the json object being passed to json_unpack
is invalid.
#5 0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
165 const char *type = treeobj_get_type (obj);
(gdb) print obj
$1 = (const json_t *) 0x2aaac91d9610
(gdb) print *obj
$2 = {type = 4056313424, refcount = 46913661567584}
I think we need to get @chu11 or @garlick to take a look. I don't know the kvs well enough to efficiently determine where to look next.
This could also be memory corruption, and since all modules run in the same address space the corruption could come from anywhere... :-(
I looked at the other corefile. The type
and refcount
have different values from the previous corefile. Likely this object is corrupted.
#3 0x00002aaac0941683 in treeobj_peek (obj=0x2aaac8e9b8d0, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
57 if (!obj || json_unpack ((json_t *)obj, "{s:i s:s s:o !}",
(gdb) print obj
$1 = (const json_t *) 0x2aaac8e9b8d0
(gdb) print *obj
$2 = {type = 828467315, refcount = 7161066662214840674}
I gave core files to @chu11 and @garlick on quartz:
quartz1916{dahn}24: give -l
achu has been given:
345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
2 file(s)
garlick has been given:
345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
2 file(s)
You have given a total of 4 file(s
I wonder if there is any way to reproduce under valgrind or asan.
So the debug markers so far:
Once @chu11 and @garlick take a look at this. If they have no clue, maybe we should use memory checkers to see if they can spot the root cause.
Because Flux is being launched with srun, we should try memcheck_all
and total view's memscape.
@grondo: Yes, memcheck_all
is valgrind.
The installed vs in-tree angle is interesting. @dongahn, have we tried just running a simple workload under the installed version to see if this issue reproduces?
I have not but we could.
I gave core files to @chu11 and @garlick on quartz:
Can you point me to which flux-broker binary was run?
file on that core should give you the path. D
From: Al Chu notifications@github.com Sent: Friday, November 1, 2019 1:39:42 PM To: flux-framework/flux-core flux-core@noreply.github.com Cc: Ahn, Dong H. ahn1@llnl.gov; Mention mention@noreply.github.com Subject: Re: [flux-framework/flux-core] Flux crashes in Jansson's unpack on UQP ensemble (#2500)
I gave core files to @chu11https://github.com/chu11 and @garlickhttps://github.com/garlick on quartz:
Can you point me to which flux-broker binary was run?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/2500?email_source=notifications&email_token=AAGSPK4CERSMJLXRVVE2CCLQRSHY5A5CNFSM4JHOZG72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4DGQY#issuecomment-548942659, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGSPK3QUYFH2ACAJZEPA5DQRSHY5ANCNFSM4JHOZG7Q.
Now I'm in my office: it is /collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/libexec/flux/cmd/flux-broker
Next time you can get this info by typing infile corefile
Assuming the json_t
object is corrupted.
(gdb) p *(json_t *)0x2aaac91d9610
$1 = {type = 4056313424, refcount = 46913661567584}
This object is stored in the internal kvs lookup handle, which looks ok
(gdb) p *(lookup_t *)0x2aaac8044260
$2 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, current_epoch = 390, ns_name = 0x0, root_ref = 0x2aaaca76cda0 "sha1-4671d4ec87a7c5d61cbc7b111e42c30b90e08899", root_seq = 2293, root_ref_set_by_user = true, path = 0x2aaaca0ba270 "output", h = 0x2aaac8001220, rolemask = 1, userid = 31193, flags = 260, aux = 0x0, val = 0x0, valref_missing_refs = 0x2aaac91d9610, missing_ref = 0x0, missing_namespace = 0x0, errnum = 0, aux_errnum = 0, levels = 0x2aaac9f059d0, wdirent = 0x2aaac91d9610, state = LOOKUP_STATE_VALUE}
and some structs inside this appear to be ok
(gdb) p *(kvsroot_mgr_t *)0x2aaac800b570
$3 = {roothash = 0x2aaac800ccf0, removelist = 0x2aaac800cde0, iterating_roots = false, h = 0x0, arg = 0x2aaac800b890}
(gdb) p *(kvs_ctx_t *)0x2aaac800b890
$4 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, faults = 49153, h = 0x2aaac8001220, rank = 0, epoch = 390, prep_w = 0x2aaac8009ac0, idle_w = 0x2aaac800b940, check_w = 0x2aaac800b8f0, transaction_merge = 1, events_init = true, hash_name = 0x2aaac800bf20 "sha1", seq = 25228}
So I've been concentrating on whether its possible just this one json_t
is corrupted by use-after-free or something like that. The json_t that's corrupted is the wdirent
field in the lookup handle, and note the key user looked up is "output", which is probably from "guest.output" (maybe this isn't relevant). This is also coming from a replay of a lookup, it wasn't in the content store to begin with (again possibly not relevant). (Edit: and one other aside, this is a lookup from a kvs-watch
b/c we're going through lookup-plus
. Possibly not relevant, but interesting.).
The "wdirent" is a convenience pointer to a json_t stored in a data structure inside the levels
zlist. Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq
aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.
Oh yeah, since this is a lookup replay, wondering if it's possible to get the broker log of this particular failure. Make sure there's nothing specific to that that shows up in the logs.
Since the broker crashes I doubt there are logs available from this corefile, but if we have a reproducer we could have @jameshcorbett re-run, this time capturing broker logs to a logfile with the following options to flux-start
:
srun ... flux start -o -Slog-filename=issue2500.log,-Slog-forward-level=7 [existing options]...
Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.
Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.
Should I go ahead and run that, @grondo ?
@jameshcorbett integrate Flux into UQP and ran it w/ a large LLNL application. An ensemble ran for awhile (~10mins) and failed because a Flux broker took
SIGSEGV
and crashes. The backtrace is:The core file size is 435MB.