Open nyh opened 5 years ago
I am not sure if that is related but I see similar issue when I run ffmpeg with httpserver as reported by #1010.
It hangs as well with this stacktrace:
#0 sched::thread::switch_to (this=0xffff8000012c9040, this@entry=0xffff800001238040) at arch/x64/arch-switch.hh:106
#1 0x00000000003f9704 in sched::cpu::reschedule_from_interrupt (this=0xffff800000c95040,
called_from_yield=called_from_yield@entry=false, preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2 0x00000000003f9bfc in sched::cpu::schedule () at include/osv/sched.hh:1309
#3 0x00000000003fa322 in sched::thread::wait (this=this@entry=0xffff800003d1c040) at core/sched.cc:1214
#4 0x00000000003dd80f in sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, waiter::wait(sched::timer*) const::{lambda()#1}>(sched::thread::dummy_lock&, waiter::wait(sched::timer*) const::{lambda()#1}) (pred=...,
mtx=<synthetic pointer>...) at include/osv/sched.hh:938
#5 sched::thread::wait_until<waiter::wait(sched::timer*) const::{lambda()#1}>(waiter::wait(sched::timer*) const::{lambda()#1}) (pred=...) at include/osv/sched.hh:1076
#6 waiter::wait (tmr=0x0, this=0x2000002ff1b0) at include/osv/wait_record.hh:46
#7 condvar::wait (this=0xffffa00002235f30, user_mutex=0xffffa00002235f08, tmr=<optimized out>) at core/condvar.cc:43
#8 0x00000000003ddc67 in condvar_wait (condvar=condvar@entry=0xffffa00002235f30,
user_mutex=user_mutex@entry=0xffffa00002235f08, expiration=expiration@entry=0) at core/condvar.cc:171
#9 0x0000000000342895 in zio_wait (zio=0xffffa00002235c00) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1288
#10 0x000000000033e4ac in zil_commit_writer (zilog=0xffff90000218c000)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1498
#11 zil_commit (zilog=zilog@entry=0xffff90000218c000, foid=foid@entry=0)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1563
#12 0x000000000033f520 in zil_commit (foid=0, zilog=0xffff90000218c000)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1778
#13 zil_close (zilog=0xffff90000218c000) at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zil.c:1778
#14 0x0000000000331f23 in zfsvfs_teardown (unmounting=true, zfsvfs=0xffff9000016e8000)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1566
#15 zfs_umount (vfsp=0xffff8000023ef040, fflag=<optimized out>)
at bsd/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:1734
#16 0x0000000000438419 in sys_umount2 (path=path@entry=0x6a4b01 "/", flags=flags@entry=1) at fs/vfs/vfs_mount.cc:270
#17 0x0000000000433fd1 in unmount_rootfs () at fs/vfs/main.cc:2354
#18 0x0000000000434071 in vfs_exit () at fs/vfs/main.cc:2406
#19 0x00000000004224dd in osv::shutdown () at core/shutdown.cc:36
#20 0x00000000002406c8 in exit (status=<optimized out>) at runtime.cc:406
#21 0x0000100003d3f458 in ?? ()
#22 0x0000100003d2d9aa in ?? ()
#23 0x000000000042af4d in osv::application::run_main (this=0xffffa000032bea10) at /usr/include/c++/8/bits/stl_vector.h:805
No unsafe_stop() I think in this case?
What is also interesting it never hangs on first run right after building the image (so no PNG files that app creates if that is what triggers it). But on second run with same image it hangs.
@wkozaczuk I think what you saw is the same problem. You can't inspect a deadlock by looking at just one thread as you looked (see my original bug report for an example how to view other threads). In other words, you saw here the one thread stuck forever trying to unmount the disk as part of the shutdown, but what you are not seeing is another thread, which was in the middle of some ZFS operation, stopped by unsafe_stop() that we did earlier, and will never complete that ZFS operation so the waiting thread will wait for its condition variable forever.
I ran the "gccgo-example" application (
scripts/build image=gccgo-example; scripts/run.py
) in a long loop, and after 203 successful runs, one run hung during exit (after having run the example correctly).The relevant threads are:
I think the deadlock warning in the comment in core/shutdown.cc materialized: We killed thread 255 with unsafe_stop() while it was in the middle of a ZFS operation. Then, when thread 253 tried to shutdown ZFS it hung, waiting forever for that now-dead ZFS operation to complete.
I wonder if we could modify unsafe_stop() to stop (not terminate) a thread, inspect its PC and check if it's inside OSv or not, and if it is resume it and try again later.