Flux crashes in Jansson's unpack on UQP ensemble

dongahn commented 4 years ago

@jameshcorbett integrate Flux into UQP and ran it w/ a large LLNL application. An ensemble ran for awhile (~10mins) and failed because a Flux broker took SIGSEGV and crashes. The backtrace is:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/libexec/flu'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.53.0-27.el7.x86_64 boost-graph-1.53.0-27.el7.x86_64 boost-regex-1.53.0-27.el7.x86_64 boost-system-1.53.0-27.el7.x86_64 czmq-3.0.2-3.el7.x86_64 hwloc-libs-1.11.8-4.el7.x86_64 jansson-2.10-1.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.1chaos.ch6.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libicu-50.2-3.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libsodium-1.0.18-1.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 libuuid-2.23.2-61.el7.x86_64 lz4-1.7.5-3.el7.x86_64 munge-libs-0.5.13-1.ch6.x86_64 numactl-libs-2.0.12-3.el7.x86_64 openpgm-5.2.122-2.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 yaml-cpp-0.5.1-2.el7.x86_64 zeromq-4.1.5-3.ch6.x86_64
(gdb) bt
#0  0x00002aaaaaf4d8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf4e743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf4e93b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac0935a0c in treeobj_peek (obj=<optimized out>, typep=0x2aaac0d73708, datap=0x0) at treeobj.c:57
#4  0x00002aaac0935d65 in treeobj_get_type (obj=<optimized out>) at treeobj.c:134
#5  0x00002aaac0935e99 in treeobj_is_dirref (obj=<optimized out>) at treeobj.c:165
#6  0x00002aaac092fa79 in lookup (lh=0x2aab00593dd0) at lookup.c:1057
#7  0x00002aaac092ba5c in lookup_common (h=h@entry=0x2aaac80011c0, mh=0x2aaac800db20, msg=msg@entry=0x2aaacb751bc0,
    arg=0x2aaac800b8b0, replay_cb=replay_cb@entry=0x2aaac092cb40 <lookup_plus_request_cb>, stall=<optimized out>)
    at kvs.c:1388
#8  0x00002aaac092cb6a in lookup_plus_request_cb (h=0x2aaac80011c0, mh=<optimized out>, msg=0x2aaacb751bc0,
    arg=<optimized out>) at kvs.c:1503
#9  0x00002aaac092ed1b in wait_runone (w=0x2aaac84a65d0) at waitqueue.c:173
#10 wait_runqueue (q=<optimized out>) at waitqueue.c:201
#11 0x00002aaac092e11c in cache_entry_set_raw (entry=entry@entry=0x2aaac8673f60, data=0x2aaacb50c458, len=286)
    at cache.c:179
#12 0x00002aaac092c4b1 in content_load_completion (f=0x2aaac86744a0, arg=0x2aaac800b8b0) at kvs.c:531
#13 0x00002aaaaad179d5 in ev_invoke_pending (loop=0x2aaac800a4e0) at ev.c:3322
#14 0x00002aaaaad1ab47 in ev_run (loop=0x2aaac800a4e0, flags=0) at ev.c:3726
#15 0x00002aaaaaceb593 in flux_reactor_run (r=0x2aaac8001270, flags=flags@entry=0) at reactor.c:126
#16 0x00002aaac092db01 in mod_main (h=0x2aaac80011c0, argc=<optimized out>, argv=<optimized out>) at kvs.c:2984
#17 0x000000000040abf6 in module_thread (arg=0x672360) at module.c:162
#18 0x00002aaaab678ea5 in start_thread (arg=0x2aaac0d74700) at pthread_create.c:307
#19 0x00002aaaac2ec8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) quit

The core file size is 435MB.

dongahn commented 4 years ago

Here is the version info:

rztopaz572{dahn}38: /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017/bin/flux --version
commands:           0.13.0-83-g30ccef9
libflux-core:       0.13.0-83-g30ccef9

dongahn commented 4 years ago

Looking at the corefile using totalview. It is the KVS module that's crashing.

dongahn commented 4 years ago

TotalView's back trace is slightly different in that the leaf function that got the segmentation violation is called json_get_alloc_funcs ():

grondo commented 4 years ago

Too bad we don't have jansson debuginfo package installed. Not sure how we can investigate what might actually be in that treeobj json_t *, or if obj is even valid memory? Maybe @chu11 will have some ideas.

dongahn commented 4 years ago

@grondo: I can build flux with (-O0 -g) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson. If worse comes to the worse, we may need to build our own Jansson in full debug mode and to look into Jansson state.

@chu11: Let me know if you have an insight about this. Otherwise, I will go ahead and try this.

grondo commented 4 years ago

Unlikely that json_get_alloc_funcs is where the segv is really occurring. This function just returns internally stored malloc and free (allowing users to override allocator used by jansson)

void json_get_alloc_funcs(json_malloc_t *malloc_fn, json_free_t *free_fn)
{
    if (malloc_fn)
        *malloc_fn = do_malloc;
    if (free_fn)
        *free_fn = do_free;
}

dongahn commented 4 years ago

The other thing we need to look at is the memory footprint of the system when this happen. This was with a large app and depending on the input deck used, out of memory is possibility.

@jameshcorbett: do you know what is the memory usage on a node? type in top on one of the allocated node should give you some info.

grondo commented 4 years ago

I can build flux with (-O0 -g) and have @jameshcorbett generate another corefile. That way, we should be able to look at the full state before getting into Jansson.

Great if the problem is reproducible!

dongahn commented 4 years ago

@grondo: somehow I can't locate the unpack() function in Jansson...

grondo commented 4 years ago

somehow I can't locate the unpack() function in Jansson...

In 2.8 it is in src/pack_unpack.c

dongahn commented 4 years ago

@jameshcorbett: could you use the following version to see if this also crashes? If so, could you "give" the corefile to me again?

/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg

jameshcorbett commented 4 years ago

Sure @dongahn, do you want me to do anything with top? Insert it into the script that launches flux?

grondo commented 4 years ago

FYI @dongahn, RedHat has a new project debuginfod which might be timely. If it works as advertised, we could have one debuginfo server per network zone with all debuginfos readily available

https://developers.redhat.com/blog/2019/10/14/introducing-debuginfod-the-elfutils-debuginfo-server/

I'll see if we have any hope of getting this set up on site. We may want to talk to Ben about STAT and TV support for remote debuginfos.

jameshcorbett commented 4 years ago

@dongahn the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash. I was running ten simulations, and five at a time. Of the first five running, two crashed with what I believe were internal errors. The other three ran to completion, but towards the end of the simulation's log file there was the line "flux-job: flux_job_event_watch_get: State not recoverable". The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.

jameshcorbett commented 4 years ago

When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.

dongahn commented 4 years ago

the ensemble has been running for thirty minutes now, which is usually long enough for it to finish or crash.

This is an unoptimized build of Flux and I wonder if that slowed things down further.

Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?

Of the first five running, two crashed with what I believe were internal errors

There are internal errors of the application itself, correct?

The other three ran to completion, but towards the end of the simulation's log file there was the line flux-job: flux_job_event_watch_get: State not recoverable.

@grondo: do you know when such an output will be printed out? Is this typically harmless?

The UQP launched jobs 6, 7, and 8 to replace the finished jobs, but those seem to running indefinitely with no output.

Hmmm. This is bad... I wonder if this is just a follow up problem from the cause of "flux-job: flux_job_event_watch_get: State not recoverable".

When I ran the last ensemble (the ensemble that generated that core file I gave you) all the log files instead had the line "flux-job: flux_reactor_run: Success" towards the bottom, shortly before the log file cut off. I hope that helps in some way.

This is helpful. I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: https://github.com/flux-framework/flux-core/pull/2468

dongahn commented 4 years ago

I am pretty baffled because the only difference between the version you had success and this version is the permission patch @grondo added: #2468

@jameshcorbett: Let's try one more thing which should be very useful given that you had a previous success.

Could you try flux from in-tree build directory?

/usr/global/tools/flux/toss_3_x86_64_ib/build/2019-10-17/flux-core/src/cmd/flux

This was essentially how you ran your Flux before #2468. If this still gives you the errors, I will build one more Flux version with the exactly same configuration as before the #2468 patch.

jameshcorbett commented 4 years ago

This is an unoptimized build of Flux and I wonder if that slowed things down further.

Also also mentioned that running this ensemble under Flux was pretty slow compared to the old method. We suspected that an issue could be that there is something wrong with binding/affinity. How do you compare the performance between Flux and your old way on this particular ensemble?

Actually, the issue I had where Flux was running slowly was a completely separate issue. That only came up when I started to create nested flux instances. Otherwise, I haven't noticed Flux being any slower than srun.

There are internal errors of the application itself, correct?

Yeah, they pop up every once in a while even when I don't use flux at all, and just srun the application.

@dongahn I started to run the in-tree flux, it should be done in half an hour.

grondo commented 4 years ago

flux-job: flux_job_event_watch_get: State not recoverable.

@grondo: do you know when such an output will be printed out? Is this typically harmless?

No, AFAIK, we should never see ENOTRECOVERABLE errors. Maybe more information would be in the flux dmesg output, but all around it seems like something bad happened to this instance.

jameshcorbett commented 4 years ago

@dongahn the in-tree build flux worked. However, I did notice that whenever the application crashed (for what I still believe are internal reasons) flux mini run didn't return, so the UQP was never able to mark them as complete. Three applications crashed, so the UQP was left running the rest of the applications just two at a time (instead of five at a time).

grondo commented 4 years ago

@jameshcorbett, the 0.13+ version of flux has a deficiency where it will wait for all tasks to exit before considering the job complete (and flux-mini run exits). If only one task in your job was crashing, this might be what you are experiencing.

Sounds like we need to address this issue sooner rather than later if that is the case.

If all your tasks were exited at the time of the crash, then there is an unknown bug we'll have to investigate further.

jameshcorbett commented 4 years ago

@grondo I think that's it. While running the application with sixteen tasks with srun and flux mini run, here's the tail of the log file for each:

Srun:

 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
 0: slurmstepd: error: *** STEP 2943859.2 ON rztopaz609 CANCELLED AT 2019-10-29T12:50:09 ***
srun: error: rztopaz609: tasks 0-15: Killed

Flux:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
PMI2_Abort: (15) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
PMI2_Abort: (1) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
PMI2_Abort: (11) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
PMI2_Abort: (7) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
PMI2_Abort: (2) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
PMI2_Abort: (9) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
PMI2_Abort: (5) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
PMI2_Abort: (0) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
PMI2_Abort: (3) application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
(hang)

In this particular case, it is strange to me that the rest of the MPI ranks don't abort, though.

grondo commented 4 years ago

In this case I think we could wire our PMI to generate a job exception on abort, which would terminate the job. However, if a task died for some other reason we'd still have the problem described above.

Sorry about the trouble! We'll work on improving it.

jameshcorbett commented 4 years ago

Oh, I see. I just expected the MPI_Abort call would have signaled all other MPI ranks to abort too, and that then Flux would have marked the job as complete.

I'm just happy that Flux isn't mysteriously crashing any more.

dongahn commented 4 years ago

I'm just happy that Flux isn't mysteriously crashing any more.

Glad this worked but we really need to track down this issue further while you are making progress by using the in-tree flux.

I think some of flux's dependencies are missing when you use the installed flux. I have no idea yet how to track this down though.

grondo commented 4 years ago

I could be wrong but I think it is up to PMI to do something with the abort (though maybe some MPIs are wired up to be able to do something global with an MPI_Abort)

I really wish we understood what was crashing Flux in your reproducer. That issue is concerning!

jameshcorbett commented 4 years ago

Let me know if there's anything I can do to help. I'll re-run the simulations with the various flux versions just to confirm the results.

dongahn commented 4 years ago

Yes. If you can run the installed version a couple more times to see if this can produce a core file (full debug), this should be extremely helpful. Thanks James!

jameshcorbett commented 4 years ago

You mean /usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg? Will do.

dongahn commented 4 years ago

Yes! Thanks.

jameshcorbett commented 4 years ago

@dongahn I just gave you two new core files. I'm not sure why the first debug run didn't crash, because these two new runs I tried overnight crashed around fifteen minutes in. But anyway, I hope they help.

dongahn commented 4 years ago

@jameshcorbett: oh great. I will take a look right away.

dongahn commented 4 years ago

Seems Flux crashes at the same spot:

#0  0x00002aaaaaf5e8b7 in unpack () from /lib64/libjansson.so.4
#1  0x00002aaaaaf5f743 in json_vunpack_ex () from /lib64/libjansson.so.4
#2  0x00002aaaaaf5f93b in json_unpack () from /lib64/libjansson.so.4
#3  0x00002aaac0941683 in treeobj_peek (obj=0x2aaac91d9610, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
#4  0x00002aaac0941993 in treeobj_get_type (obj=0x2aaac91d9610) at treeobj.c:134
#5  0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
#6  0x00002aaac093a488 in lookup (lh=0x2aaac8044260) at lookup.c:1057
#7  0x00002aaac093338a in lookup_common (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890,
    replay_cb=0x2aaac093373c <lookup_plus_request_cb>, stall=0x2aaac0d89793) at kvs.c:1388
#8  0x00002aaac0933786 in lookup_plus_request_cb (h=0x2aaac8001220, mh=0x2aaac800db00, msg=0x2aaaf39df7e0, arg=0x2aaac800b890) at kvs.c:1503
#9  0x00002aaac09384fb in wait_runone (w=0x2aaac9edd130) at waitqueue.c:173
#10 0x00002aaac09385ab in wait_runqueue (q=0x2aaac9829220) at waitqueue.c:201
#11 0x00002aaac0937610 in cache_entry_set_raw (entry=0x2aaaf24e6f80, data=0x2aaacb69cb58, len=286) at cache.c:179
#12 0x00002aaac0931483 in content_load_completion (f=0x2aaac9829240, arg=0x2aaac800b890) at kvs.c:531
#13 0x00002aaaaacf79a3 in check_cb (r=0x2aaac8009a50, w=0x2aaac9829040, revents=0, arg=0x2aaac9829240) at future.c:796
#14 0x00002aaaaacebfbf in check_cb (loop=0x2aaac800a540, cw=0x2aaac9829068, revents=32768) at reactor.c:853
#15 0x00002aaaaad28d93 in ev_invoke_pending (loop=0x2aaac800a540) at ev.c:3322
#16 0x00002aaaaad29c85 in ev_run (loop=0x2aaac800a540, flags=0) at ev.c:3726
#17 0x00002aaaaacea717 in flux_reactor_run (r=0x2aaac8009a50, flags=0) at reactor.c:126
#18 0x00002aaac093716a in mod_main (h=0x2aaac8001220, argc=0, argv=0x2aaac800b3b0) at kvs.c:2984
#19 0x000000000040bf76 in module_thread (arg=0x67cf30) at module.c:162
#20 0x00002aaaab689ea5 in start_thread (arg=0x2aaac0d8a700) at pthread_create.c:307
#21 0x00002aaaac2fd8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

dongahn commented 4 years ago

json_t * is first passed into treeobj_is_dirref and its type field is 4056313424 and refcount 46913661567584. Both seem pretty large and this makes me believe the json object being passed to json_unpack is invalid.

#5  0x00002aaac0941ae8 in treeobj_is_dirref (obj=0x2aaac91d9610) at treeobj.c:165
165     const char *type = treeobj_get_type (obj);
(gdb) print obj
$1 = (const json_t *) 0x2aaac91d9610
(gdb) print *obj
$2 = {type = 4056313424, refcount = 46913661567584}

grondo commented 4 years ago

I think we need to get @chu11 or @garlick to take a look. I don't know the kvs well enough to efficiently determine where to look next.

This could also be memory corruption, and since all modules run in the same address space the corruption could come from anywhere... :-(

dongahn commented 4 years ago

I looked at the other corefile. The type and refcount have different values from the previous corefile. Likely this object is corrupted.

#3  0x00002aaac0941683 in treeobj_peek (obj=0x2aaac8e9b8d0, typep=0x2aaac0d89588, datap=0x0) at treeobj.c:57
57      if (!obj || json_unpack ((json_t *)obj, "{s:i s:s s:o !}",
(gdb) print obj
$1 = (const json_t *) 0x2aaac8e9b8d0
(gdb) print *obj
$2 = {type = 828467315, refcount = 7161066662214840674}

dongahn commented 4 years ago

I gave core files to @chu11 and @garlick on quartz:

quartz1916{dahn}24: give -l
 achu has been given:
   345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
   2 file(s)
 garlick has been given:
   345 MB Nov 01 10:25 rztopaz290-flux-broker-0-22454.core
   363 MB Nov 01 10:25 rztopaz484-flux-broker-0-48886.core
   2 file(s)
You have given a total of 4 file(s

grondo commented 4 years ago

I wonder if there is any way to reproduce under valgrind or asan.

dongahn commented 4 years ago

So the debug markers so far:

When this occurs, it happens exact at the same stack trace;
There could be other manifestations, but that could have happened due to a pilot error;
It crashes because the json_t object is likely invalid;
The user could make it happen with an "installed" flux but not with an "in-tree" flux;

Once @chu11 and @garlick take a look at this. If they have no clue, maybe we should use memory checkers to see if they can spot the root cause.

Because Flux is being launched with srun, we should try memcheck_all and total view's memscape.

dongahn commented 4 years ago

@grondo: Yes, memcheck_all is valgrind.

grondo commented 4 years ago

The installed vs in-tree angle is interesting. @dongahn, have we tried just running a simple workload under the installed version to see if this issue reproduces?

dongahn commented 4 years ago

I have not but we could.

chu11 commented 4 years ago

I gave core files to @chu11 and @garlick on quartz:

Can you point me to which flux-broker binary was run?

dongahn commented 4 years ago

file on that core should give you the path. D

From: Al Chu notifications@github.com Sent: Friday, November 1, 2019 1:39:42 PM To: flux-framework/flux-core flux-core@noreply.github.com Cc: Ahn, Dong H. ahn1@llnl.gov; Mention mention@noreply.github.com Subject: Re: [flux-framework/flux-core] Flux crashes in Jansson's unpack on UQP ensemble (#2500)

I gave core files to @chu11https://github.com/chu11 and @garlickhttps://github.com/garlick on quartz:

Can you point me to which flux-broker binary was run?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/flux-framework/flux-core/issues/2500?email_source=notifications&email_token=AAGSPK4CERSMJLXRVVE2CCLQRSHY5A5CNFSM4JHOZG72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEC4DGQY#issuecomment-548942659, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAGSPK3QUYFH2ACAJZEPA5DQRSHY5ANCNFSM4JHOZG7Q.

dongahn commented 4 years ago

Now I'm in my office: it is /collab/usr/global/tools/flux/toss_3_x86_64_ib/flux-0.13.x-20191017-dbg/libexec/flux/cmd/flux-broker

Next time you can get this info by typing infile corefile

chu11 commented 4 years ago

Assuming the json_t object is corrupted.

(gdb) p *(json_t *)0x2aaac91d9610
$1 = {type = 4056313424, refcount = 46913661567584}

This object is stored in the internal kvs lookup handle, which looks ok

(gdb) p *(lookup_t *)0x2aaac8044260
$2 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, current_epoch = 390, ns_name = 0x0, root_ref = 0x2aaaca76cda0 "sha1-4671d4ec87a7c5d61cbc7b111e42c30b90e08899", root_seq = 2293, root_ref_set_by_user = true, path = 0x2aaaca0ba270 "output", h = 0x2aaac8001220, rolemask = 1, userid = 31193, flags = 260, aux = 0x0, val = 0x0, valref_missing_refs = 0x2aaac91d9610, missing_ref = 0x0, missing_namespace = 0x0, errnum = 0, aux_errnum = 0, levels = 0x2aaac9f059d0, wdirent = 0x2aaac91d9610, state = LOOKUP_STATE_VALUE}

and some structs inside this appear to be ok

(gdb) p *(kvsroot_mgr_t *)0x2aaac800b570
$3 = {roothash = 0x2aaac800ccf0, removelist = 0x2aaac800cde0, iterating_roots = false, h = 0x0, arg = 0x2aaac800b890}

(gdb) p *(kvs_ctx_t *)0x2aaac800b890
$4 = {cache = 0x2aaac8009b10, krm = 0x2aaac800b570, faults = 49153, h = 0x2aaac8001220, rank = 0, epoch = 390, prep_w = 0x2aaac8009ac0, idle_w = 0x2aaac800b940, check_w = 0x2aaac800b8f0, transaction_merge = 1, events_init = true, hash_name = 0x2aaac800bf20 "sha1", seq = 25228}

So I've been concentrating on whether its possible just this one json_t is corrupted by use-after-free or something like that. The json_t that's corrupted is the wdirent field in the lookup handle, and note the key user looked up is "output", which is probably from "guest.output" (maybe this isn't relevant). This is also coming from a replay of a lookup, it wasn't in the content store to begin with (again possibly not relevant). (Edit: and one other aside, this is a lookup from a kvs-watch b/c we're going through lookup-plus. Possibly not relevant, but interesting.).

The "wdirent" is a convenience pointer to a json_t stored in a data structure inside the levels zlist. Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.

chu11 commented 4 years ago

Oh yeah, since this is a lookup replay, wondering if it's possible to get the broker log of this particular failure. Make sure there's nothing specific to that that shows up in the logs.

grondo commented 4 years ago

Since the broker crashes I doubt there are logs available from this corefile, but if we have a reproducer we could have @jameshcorbett re-run, this time capturing broker logs to a logfile with the following options to flux-start:

 srun ... flux start -o -Slog-filename=issue2500.log,-Slog-forward-level=7 [existing options]...

dongahn commented 4 years ago

Unfortunately I can't walk the "levels" zlist_t b/c debug symbols for czmq aren't on quartz. Anyone know what the easiest way to go about debugging with those symbols? In the past on test systems, I would install the debug rpms.

Even if you have the debug rpms, your mileage would vary because zlist is optimized and in many cases the variables can be optimized out for debugging.

jameshcorbett commented 4 years ago

Should I go ahead and run that, @grondo ?

flux-framework / flux-core

Flux crashes in Jansson's unpack on UQP ensemble #2500