hang (deadlock) on ZFS unmount during shutdown

nyh commented 5 years ago

I ran the "gccgo-example" application (scripts/build image=gccgo-example; scripts/run.py) in a long loop, and after 203 successful runs, one run hung during exit (after having run the example correctly).

The relevant threads are:

(gdb) osv info threads
...
 253 (0xffff8000051a3040) /hello.so       cpu1 status::waiting ?? at core/condvar.cc:43 vruntime  1.62882e-20
 254 (0xffff80000542a040) >/hello.so      cpu1 status::terminated ?? at libc/time.cc:38 vruntime  4.46936e-22
 255 (0xffff800005f9c040) >/hello.so      cpu1 status::terminated ?? at core/condvar.cc:43 vruntime  1.74049e-21
...
(gdb) osv thread 253
(gdb) bt
...
#7  condvar::wait(lockfree::mutex*, sched::timer*) () at core/condvar.cc:43
#8  0x00000000003d4d0f in condvar_wait () at core/condvar.cc:171
#9  0x00000000002f4b04 in rrw_enter_write (rrl=0xffff9000015d2120)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:197
#10 rrw_enter (rrl=0xffff9000015d2120, rw=<optimized out>, tag=<optimized out>)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/rrwlock.c:210
#11 0x000000000032b6e6 in zfs_umount ()
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1691
#12 0x000000000042ccf5 in sys_umount2 () at fs/vfs/vfs_mount.cc:270
#13 0x0000000000428b1b in unmount_rootfs () at fs/vfs/main.cc:2352
#14 0x0000000000428ba7 in vfs_exit() () at fs/vfs/main.cc:2404
#15 0x0000000000417767 in osv::shutdown() () at core/shutdown.cc:36
#16 0x0000000000240323 in exit () at runtime.cc:406
#17 0x00001000026bf94f in runtime.main () at ../../../libgo/go/runtime/proc.go:247
#18 0x00001000026bca89 in runtime.kickoff () at ../../../libgo/go/runtime/proc.go:1167
#19 0x0000000000440e50 in setcontext () at libc/arch/x64/ucontext/setcontext.s:69

(gdb) osv thread 255
(gdb) bt
#8  0x00000000003d4d0f in condvar_wait () at core/condvar.cc:171
#9  0x000000000033b965 in zio_wait (zio=zio@entry=0xffffa00005dd6400)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1288
#10 0x00000000002d1518 in dmu_buf_hold_array_by_dnode_sparse (dn=0xffff9000029a7000, 
    offset=offset@entry=44171264, length=length@entry=4096, read=read@entry=1, 
    tag=tag@entry=0x63ad18 <__func__.13132>, numbufsp=numbufsp@entry=0xffff800005fa0804, 
    dbpp=0xffff800005fa0808, flags=0, fail_sparse=1)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:439
#11 0x00000000002d1823 in dmu_buf_hold_array_sparse (os=os@entry=0xffff9000021db000, 
    object=<optimized out>, offset=44171264, length=4096, read=read@entry=1, 
    tag=tag@entry=0x63ad18 <__func__.13132>, numbufsp=0xffff800005fa0804, dbpp=0xffff800005fa0808, 
    fail_sparce=1) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:489
#12 0x00000000002d2976 in dmu_map_uio (os=os@entry=0xffff9000021db000, object=<optimized out>, 
    uio=uio@entry=0xffff800005fa0890, size=<optimized out>)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1007
#13 0x000000000032d3c6 in zfs_arc (vp=<optimized out>, fp=<optimized out>, uio=0xffff800005fa0890)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:753
#14 0x000000000043132d in vfs_file::get_arcbuf (this=this@entry=0xffffa00005dd8280, 
    key=key@entry=0xffff800005fa0920, offset=<optimized out>) at fs/vfs/vfs_fops.cc:175
#15 0x00000000003d8c47 in pagecache::create_read_cached_page (key=..., fp=0xffffa00005dd8280)
    at core/pagecache.cc:470
#16 pagecache::get(vfs_file*, long, mmu::hw_ptep<0>, mmu::pt_element<0>, bool, bool) ()
    at core/pagecache.cc:470
#17 0x0000000000430f30 in vfs_file::map_page (this=<optimized out>, off=<optimized out>, ptep=..., 
    pte=..., write=<optimized out>, shared=<optimized out>) at fs/vfs/vfs_fops.cc:143
#18 0x00000000003453f1 in mmu::map_file_page_mmap::map (this=<optimized out>, offset=<optimized out>, 
    ptep=..., pte=..., write=<optimized out>) at core/mmu.cc:1095
#19 0x0000000000347d30 in mmu::populate<(mmu::account_opt)1>::page<0> (offset=<optimized out>, ptep=..., 
    this=0xffff800005fa0ba0) at arch/x64/arch-mmu.hh:23
#20 mmu::populate_small<(mmu::account_opt)1>::page<0> (ptep=..., offset=<optimized out>, 
    this=0xffff800005fa0ba0) at core/mmu.cc:551
#21 mmu::page<mmu::populate_small<(mmu::account_opt)1>, 0> (ptep=..., offset=<optimized out>, pops=...)
    at core/mmu.cc:312
#22 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 1>::operator() (base_virt=35184401543168, 
    parent=..., this=<synthetic pointer>) at core/mmu.cc:446
#23 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 2>::map_range<1> (this=<synthetic pointer>, 
    ptep=..., base_virt=35184401448960, slop=4096, page_mapper=..., size=4096, vcur=<optimized out>)
    at core/mmu.cc:400
#24 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 2>::operator() (base_virt=35184401448960, 
    parent=..., this=<synthetic pointer>) at core/mmu.cc:450
#25 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 3>::map_range<2> (this=<synthetic pointer>, 
    ptep=..., base_virt=35184372088832, slop=4096, page_mapper=..., size=4096, vcur=<optimized out>)
    at core/mmu.cc:400
#26 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 3>::operator() (base_virt=35184372088832, 
    parent=..., this=<synthetic pointer>) at core/mmu.cc:450
#27 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 4>::map_range<3> (this=0xffff800005fa0bc0, 
    ptep=..., base_virt=35184372088832, slop=4096, page_mapper=..., size=<optimized out>, 
    vcur=<optimized out>) at core/mmu.cc:400
#28 mmu::map_level<mmu::populate_small<(mmu::account_opt)1>, 4>::operator() (
    this=this@entry=0xffff800005fa0bc0, parent=..., base_virt=35184372088832, base_virt@entry=0)
    at core/mmu.cc:450
#29 0x000000000034a56b in mmu::map_range<mmu::populate_small<(mmu::account_opt)1> > (slop=4096, 
    page_mapper=..., size=<optimized out>, vstart=35184401543168, vma_start=<optimized out>)
    at include/osv/addr_range.hh:15
#30 mmu::operate_range<mmu::populate_small<(mmu::account_opt)1> > (size=<optimized out>, 
    start=0x200001c17000, vma_start=<optimized out>, mapper=...) at core/mmu.cc:807
#31 mmu::vma::operate_range<mmu::populate_small<(mmu::account_opt)1> > (size=1540804707, 
    addr=0x200001c17000, mapper=..., this=0xffffa00005dd8400) at core/mmu.cc:1413
#32 mmu::populate_vma<(mmu::account_opt)1> (vma=vma@entry=0xffffa00005dd8400, v=v@entry=0x200001c17000, 
    size=size@entry=4096, write=<optimized out>) at core/mmu.cc:1207
#33 0x0000000000343fcb in mmu::file_vma::fault(unsigned long, exception_frame*) ()
    at arch/x64/exceptions.hh:46
#34 0x0000000000341079 in mmu::vm_fault(unsigned long, exception_frame*) () at core/mmu.cc:1342
#35 0x0000000000399774 in page_fault () at arch/x64/mmu.cc:38
#36 <signal handler called>
#37 elf_initialize_syminfo (opd=0x0, sdata=0x2000003c52b8, data=0x2000009003e0, 
    error_callback=0x10000229bb40 <error_callback>, strtab_size=<optimized out>, strtab=<optimized out>, 
    symtab_size=3445536, symtab_data=0x200001c003d0 "", base_address=17592211210240, state=0x2000003c4000)
    at ../../../libbacktrace/elf.c:614
#38 elf_add (state=0x2000003c4000, filename=filename@entry=0x2000008ffe90 "/libgo.so.13", descriptor=3, 
    base_address=17592211210240, error_callback=0x10000229bb40 <error_callback>, data=0x2000009003e0, 
    fileline_fn=0x2000008ffe20, found_sym=0x2000008fff40, found_dwarf=0x2000008ffe1c, exe=0, debuginfo=0)
    at ../../../libbacktrace/elf.c:2956
#39 0x00001000027b14a8 in phdr_callback (info=0x2000008ffea0, size=<optimized out>, pdata=0x2000008fff50)
    at ../../../libbacktrace/elf.c:3270
#40 0x000000000044eab9 in <lambda(const elf::program::modules_list&)>::operator() (ml=..., 
    __closure=<synthetic pointer>) at libc/dlfcn.cc:118
#41 elf::program::with_modules<dl_iterate_phdr(int (*)(dl_phdr_info*, size_t, void*), void*)::<lambda(const elf::program::modules_list&)> > (f=..., this=0xffff900001154000) at include/osv/elf.hh:668
#42 dl_iterate_phdr () at libc/dlfcn.cc:99
#43 0x00001000027b16d1 in backtrace_initialize (state=state@entry=0x2000003c4000, 
    filename=filename@entry=0xffff9000012f5000 "/hello.so", descriptor=<optimized out>, 
    error_callback=error_callback@entry=0x10000229bb40 <error_callback>, data=data@entry=0x2000009003e0, 
    fileline_fn=fileline_fn@entry=0x2000008ffff8) at ../../../libbacktrace/elf.c:3313
#44 0x00001000027adb97 in fileline_initialize (state=state@entry=0x2000003c4000, 
    error_callback=error_callback@entry=0x10000229bb40 <error_callback>, data=data@entry=0x2000009003e0)
    at ../../../libbacktrace/fileline.c:143
#45 0x00001000027adca6 in backtrace_pcinfo (state=0x2000003c4000, pc=17592222333978, 
    callback=0x10000229b7e0 <callback>, error_callback=0x10000229bb40 <error_callback>, 
    data=0x2000009003e0) at ../../../libbacktrace/fileline.c:177
#46 0x00001000027ae286 in unwind (context=<optimized out>, vdata=0x200000900390)
    at ../../../libbacktrace/backtrace.c:91
#47 0x000010000320ff59 in _Unwind_Backtrace ()
#48 0x00001000027ae30d in backtrace_full (state=state@entry=0x2000003c4000, skip=skip@entry=0, 
    callback=callback@entry=0x10000229b7e0 <callback>, 
    error_callback=error_callback@entry=0x10000229bb40 <error_callback>, data=data@entry=0x2000009003e0)
    at ../../../libbacktrace/backtrace.c:127
#49 0x000010000229bc1b in runtime_callers (skip=<optimized out>, locbuf=<optimized out>, 
    m=<optimized out>, keep_thunks=keep_thunks@entry=false) at ../../../libgo/runtime/go-callers.c:207
#50 0x000010000269927e in runtime.callers (skip=skip@entry=4, locbuf=...)
    at ../../../libgo/go/runtime/traceback_gccgo.go:56
#51 0x00001000026cbd83 in runtime.mProf_Malloc (size=2688, p=<optimized out>)
    at ../../../libgo/go/runtime/mprof.go:342
#52 runtime.profilealloc (mp=0xc420016800, size=2688, x=0xc42006aa80)
    at ../../../libgo/go/runtime/malloc.go:899
#53 runtime.mallocgc (size=<optimized out>, typ=<optimized out>, needzero=needzero@entry=true)
    at ../../../libgo/go/runtime/malloc.go:815
#54 0x00001000026cd80f in runtime.newobject (typ=typ@entry=0x100002f56360 <runtime..runtime.g..d>)
    at ../../../libgo/go/runtime/malloc.go:873
#55 0x00001000026c978b in runtime.allocg () at ../../../libgo/go/runtime/stubs.go:362
#56 0x000010000229f360 in runtime.malg (allocatestack=allocatestack@entry=false, 
    signalstack=signalstack@entry=false, ret_stack=ret_stack@entry=0x200000900d58, 
    ret_stacksize=ret_stacksize@entry=0x200000900d60) at ../../../libgo/runtime/proc.c:730
#57 0x00001000026b8ca8 in runtime.allocm (_p_=_p_@entry=0xc420010000, fn=<optimized out>, 
    allocatestack=allocatestack@entry=false) at ../../../libgo/go/runtime/proc.go:1468
#58 0x00001000026bd8a7 in runtime.newm (fn=fn@entry=0x100002d53828 <runtime.mspinning..f>, 
    _p_=_p_@entry=0xc420010000) at ../../../libgo/go/runtime/proc.go:1769
#59 0x00001000026bde96 in runtime.startm (_p_=0xc420010000, _p_@entry=0x0, spinning=spinning@entry=true)
    at ../../../libgo/go/runtime/proc.go:1925
#60 0x00001000026be352 in runtime.wakep () at ../../../libgo/go/runtime/proc.go:2006
#61 0x00001000026c0320 in runtime.resetspinning () at ../../../libgo/go/runtime/proc.go:2376
#62 runtime.schedule () at ../../../libgo/go/runtime/proc.go:2485
#63 0x00001000026c8171 in runtime.mstart1 (dummy=dummy@entry=0) at ../../../libgo/go/runtime/proc.go:1198
#64 0x000010000229f08e in runtime_mstart (arg=<optimized out>) at ../../../libgo/runtime/proc.c:545
#65 0x000000000044b5c6 in pthread_private::pthread::<lambda()>::operator() (__closure=0xffffa00003d56300)
    at libc/pthread.cc:114
#66 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/8/bits/std_function.h:297
#67 0x00000000003f1e67 in sched::thread_main_c (t=0xffff800005f9c040) at arch/x64/arch-switch.hh:268
#68 0x0000000000399573 in thread_main () at arch/x64/entry.S:113

I think the deadlock warning in the comment in core/shutdown.cc materialized: We killed thread 255 with unsafe_stop() while it was in the middle of a ZFS operation. Then, when thread 253 tried to shutdown ZFS it hung, waiting forever for that now-dead ZFS operation to complete.

I wonder if we could modify unsafe_stop() to stop (not terminate) a thread, inspect its PC and check if it's inside OSv or not, and if it is resume it and try again later.

wkozaczuk commented 5 years ago

I am not sure if that is related but I see similar issue when I run ffmpeg with httpserver as reported by #1010.

It hangs as well with this stacktrace:

#0  sched::thread::switch_to (this=0xffff8000012c9040, this@entry=0xffff800001238040) at arch/x64/arch-switch.hh:106
#1  0x00000000003f9704 in sched::cpu::reschedule_from_interrupt (this=0xffff800000c95040, 
    called_from_yield=called_from_yield@entry=false, preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x00000000003f9bfc in sched::cpu::schedule () at include/osv/sched.hh:1309
#3  0x00000000003fa322 in sched::thread::wait (this=this@entry=0xffff800003d1c040) at core/sched.cc:1214
#4  0x00000000003dd80f in sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, waiter::wait(sched::timer*) const::{lambda()#1}>(sched::thread::dummy_lock&, waiter::wait(sched::timer*) const::{lambda()#1}) (pred=..., 
    mtx=<synthetic pointer>...) at include/osv/sched.hh:938
#5  sched::thread::wait_until<waiter::wait(sched::timer*) const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1}) (pred=...) at include/osv/sched.hh:1076
#6  waiter::wait (tmr=0x0, this=0x2000002ff1b0) at include/osv/wait_record.hh:46
#7  condvar::wait (this=0xffffa00002235f30, user_mutex=0xffffa00002235f08, tmr=<optimized out>) at core/condvar.cc:43
#8  0x00000000003ddc67 in condvar_wait (condvar=condvar@entry=0xffffa00002235f30, 
    user_mutex=user_mutex@entry=0xffffa00002235f08, expiration=expiration@entry=0) at core/condvar.cc:171
#9  0x0000000000342895 in zio_wait (zio=0xffffa00002235c00) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1288
#10 0x000000000033e4ac in zil_commit_writer (zilog=0xffff90000218c000)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1498
#11 zil_commit (zilog=zilog@entry=0xffff90000218c000, foid=foid@entry=0)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1563
#12 0x000000000033f520 in zil_commit (foid=0, zilog=0xffff90000218c000)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1778
#13 zil_close (zilog=0xffff90000218c000) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1778
#14 0x0000000000331f23 in zfsvfs_teardown (unmounting=true, zfsvfs=0xffff9000016e8000)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1566
#15 zfs_umount (vfsp=0xffff8000023ef040, fflag=<optimized out>)
    at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1734
#16 0x0000000000438419 in sys_umount2 (path=path@entry=0x6a4b01 "/", flags=flags@entry=1) at fs/vfs/vfs_mount.cc:270
#17 0x0000000000433fd1 in unmount_rootfs () at fs/vfs/main.cc:2354
#18 0x0000000000434071 in vfs_exit () at fs/vfs/main.cc:2406
#19 0x00000000004224dd in osv::shutdown () at core/shutdown.cc:36
#20 0x00000000002406c8 in exit (status=<optimized out>) at runtime.cc:406
#21 0x0000100003d3f458 in ?? ()
#22 0x0000100003d2d9aa in ?? ()
#23 0x000000000042af4d in osv::application::run_main (this=0xffffa000032bea10) at /usr/include/c++/8/bits/stl_vector.h:805

No unsafe_stop() I think in this case?

What is also interesting it never hangs on first run right after building the image (so no PNG files that app creates if that is what triggers it). But on second run with same image it hangs.

nyh commented 5 years ago

@wkozaczuk I think what you saw is the same problem. You can't inspect a deadlock by looking at just one thread as you looked (see my original bug report for an example how to view other threads). In other words, you saw here the one thread stuck forever trying to unmount the disk as part of the shutdown, but what you are not seeing is another thread, which was in the middle of some ZFS operation, stopped by unsafe_stop() that we did earlier, and will never complete that ZFS operation so the waiting thread will wait for its condition variable forever.

cloudius-systems / osv

hang (deadlock) on ZFS unmount during shutdown #1007