Open yuwang888 opened 6 years ago
does there have bug in SOCK_UNLOCK(so) and sodealloc? the mutex is zero
11 0x0000000000237646 in soabort (so=0xffffa00185a18400) at bsd/sys/kern/uipc_socket.cc:706
#12 0x0000000000237bc5 in soclose (so=0xffffa00185885a00) at bsd/sys/kern/uipc_socket.cc:656
#13 0x0000000000430269 in fdrop (fp=0xffffa001859fef00) at fs/vfs/kern_descrip.cc:211
#14 0x0000000000430470 in fdset (fd=fd@entry=189, fp=<optimized out>) at fs/vfs/kern_descrip.cc:119
#15 0x000000000042cdba in dup3 (oldfd=<optimized out>, newfd=189, flags=<optimized out>)
at fs/vfs/main.cc:1402
#16 0x000010000d00edb2 in closefd ()
#17 0x000020000061897b in ?? ()
#18 0x00002000205fe748 in ?? ()
#19 0x00002000205fe528 in ?? ()
#20 0x0000000000000000 in ?? ()
(gdb) p *(struct socket *)0xffffa00185a18400
$1 = {so_mtx = 0x0, so_freeing = 1, so_count = 0, so_type = 1, so_options = 4, so_linger = 0,
so_state = 8449, so_qstate = 0, so_pcb = 0x0, so_vnet = 0x0, so_proto = 0xb44150 <inetsw+208>,
so_head = 0x0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0}, so_comp = {tqh_first = 0x0,
tqh_last = 0x0}, so_list = {tqe_next = 0x0, tqe_prev = 0xffffa00185885a50}, so_qlen = 0,
so_incqlen = 0, so_qlimit = 0, so_timeo = 0, so_error = 103, so_oobmark = 0, so_aiojobq = {
tqh_first = 0x0, tqh_last = 0xffffa00185a18488}, so_rcv = {sb_iolock = {_wq = {_waiters_fifo = {
oldest = 0x0, newest = 0x0}}, _owner = 0x0}, sb_state = 32, sb_mb = 0x0, sb_mbtail = 0x0,
sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0, sb_cc = 0, sb_cc_wq = {_waiters_fifo = {
oldest = 0x0, newest = 0x0}}, sb_hiwat = 0, sb_mbcnt = 0, sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 0,
sb_ctl = 0, sb_lowat = 1, sb_timeo = 0, sb_flags = 2048, sb_upcall = 0x0, sb_upcallarg = 0x0},
so_snd = {sb_iolock = {_wq = {_waiters_fifo = {oldest = 0x0, newest = 0x0}}, _owner = 0x0},
sb_state = 16, sb_mb = 0x0, sb_mbtail = 0x0, sb_lastrecord = 0x0, sb_sndptr = 0x0, sb_sndptroff = 0,
sb_cc = 0, sb_cc_wq = {_waiters_fifo = {oldest = 0x0, newest = 0x0}}, sb_hiwat = 0, sb_mbcnt = 0,
sb_mcnt = 0, sb_ccnt = 0, sb_mbmax = 0, sb_ctl = 0, sb_lowat = 2048, sb_timeo = 0, sb_flags = 2048,
sb_upcall = 0x0, sb_upcallarg = 0x0}, so_gencnt = 46, so_emuldata = 0x0, so_accf = 0x0, so_fibnum = 0,
so_user_cookie = 0, so_nc = 0xffff9001859ed000, so_nc_busy = false, so_nc_wq = {_waiters_fifo = {
oldest = 0x0, newest = 0x0}}, fp = 0x0}
For future reference, please run in gdb "osv syms" to have the symbols in your own code (frames 16-18) also decoded, so you'll know what called "dup3" in your code. Refer to https://github.com/cloudius-systems/osv/wiki/Debugging-OSv which explains "osv syms" and other issues of debugging in OSv.
Anyway, this definitely looks like an OSv bug:
In frame 15, dup3() is closing an old file descriptor 189 which it found to be open. This ends up calling soclose() (frame 12) because the old file descriptor was a socket. soclose() notices (in frame 12) that this is a listening socket (listen() was called for it), and finds in so_comp a new connection which the kernel already established, but the application hasn't accept()ed yet, and wants to abort this connection (since after closing the listening socket, the application can never accept it). By calling (see bsd/sys/kern/uipc_socket.cc:655) soabort() on the new connection.
soabort(sp) starts by trying to lock new socket, using the socket's "mutex* so_mtx" filed. However, our new socket objects start with so_mtx set to a null pointer, and this is what we see in frame 10 (lock was called on a mutex in address 0). The mutex is only set in inpcb::inpcb() (which calls so->set_mutex()).
Here is where I stopped investigating, maybe you can help me continue - when is the inpcb supposed to be created, only later when the connection is accept()ed? Maybe we changed this order as part of the Van-Jacobson reorganization of the code, and forgot to fix this case?
In any case, I think as a quick workaround, you can modify soabort() in bsd/sys/kern/uipc_socket.cc:705, instead of SOCK_LOCK(so), do something like
if (SOCK_MTX(so))
SOCK_LOCK(so)
You'll also need to change sofree(), same file, from SOCK_LOCK_ASSERT(so) to
if (SOCK_MTX(so))
SOCK_LOCK_ASSERT(so)
and potentially in additional places in sofree() which locks and unlocks so.
Even if this works, the question becomes whether it is ok that the connected by un-accepted connection has a nullptr lock. This needs to be further investigated. I'm assuming that a connection that has never been accept()ed doesn't need any locking. Obviously, the application cannot be using it because it never accepted it. But could the kernel be using it internally at this point? It is really ok that we don't have a lock for it?
Maybe this discussion, of a double free in soabort(), is related? https://groups.google.com/d/msg/osv-dev/uOv1057u5Q0/xTBFvyucCQAJ
i try to porting shop center to osv, but it failed