cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.09k stars 602 forks source link

waitqueue disarm() sometimes misbehaves #1285

Closed wkozaczuk closed 9 months ago

wkozaczuk commented 9 months ago

When testing the java http server when running on OSv with Linux dynamic linker, I occasionally (1 out of 3 times) would see this crash in the sched::wait_object<waitqueue>::arm():

(gdb) bt
#0  0x000000004030d492 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:61
#2  osv::halt () at arch/x64/power.cc:29
#3  0x00000000402427e0 in abort (fmt=fmt@entry=0x405d87d3 "Aborted\n") at runtime.cc:145
#4  0x00000000402029b1 in abort () at runtime.cc:106
#5  0x0000000040205288 in mmu::vm_sigsegv (ef=0x40000371e088, addr=0) at core/mmu.cc:1426
#6  mmu::vm_sigsegv (ef=0x40000371e088, addr=0) at core/mmu.cc:1420
#7  mmu::vm_fault (addr=0, addr@entry=8, ef=ef@entry=0x40000371e088) at core/mmu.cc:1456
#8  0x0000000040306b5c in page_fault (ef=0x40000371e088) at arch/x64/mmu.cc:42
#9  <signal handler called>
#10 0x00000000403a18c1 in sched::wait_object<waitqueue>::arm (this=this@entry=0x400003733a40) at core/waitqueue.cc:24
#11 0x00000000403679b7 in sched::arm<sched::wait_object<waitqueue>>(sched::wait_object<waitqueue>&) (first=...) at include/osv/sched.hh:1324
#12 sched::arm<sched::wait_object<sched::timer>, sched::wait_object<waitqueue> > (first=...) at include/osv/sched.hh:1325
#13 sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<sched::timer>, sched::wait_object<waitqueue> > (mtx=...) at include/osv/sched.hh:1369
#14 sched::thread::wait_for<sched::timer&, waitqueue&> (mtx=...) at include/osv/sched.hh:1395
#15 futex (uaddr=uaddr@entry=0x2000840230b8, op=op@entry=137, val=val@entry=0, timeout=timeout@entry=0x20009fdd9b20, uaddr2=uaddr2@entry=0x0, val3=val3@entry=4294967295) at linux.cc:110
#16 0x00000000403692e1 in syscall (number=number@entry=202) at linux.cc:751
#17 0x000000004036c18b in syscall_wrapper (number=202, p1=35186586824888, p2=137, p3=0, p4=35187054189344, p5=0, p6=4294967295) at linux.cc:921
#18 <signal handler called>
#19 __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x20009fdd9b20, op=137, expected=0, futex_word=0x2000840230b8) at futex-internal.c:57
#20 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x2000840230b8, expected=expected@entry=0, clockid=clockid@entry=1, abstime=abstime@entry=0x20009fdd9b20, 
    private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#21 0x000020007ec87d7f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x2000840230b8, expected=expected@entry=0, clockid=clockid@entry=1, 
    abstime=abstime@entry=0x20009fdd9b20, private=private@entry=0) at futex-internal.c:139
#22 0x000020007ec8a804 in __pthread_cond_wait_common (abstime=0x20009fdd9b20, clockid=1, mutex=0x200084023068, cond=0x200084023090) at pthread_cond_wait.c:503
#23 ___pthread_cond_timedwait64 (cond=0x200084023090, mutex=0x200084023068, abstime=0x20009fdd9b20) at pthread_cond_wait.c:643
#24 0x000020007fa88ca3 in os::PlatformMonitor::wait(long) ()
#25 0x000020007fa34172 in Monitor::wait(long) ()
#26 0x000020007f3fc52d in CompileQueue::get() ()
#27 0x000020007f3ff5ac in CompileBroker::compiler_thread_loop() ()
#28 0x000020007fcdb567 in JavaThread::thread_main_inner() ()
#29 0x000020007fcdeefe in Thread::call_run() ()
#30 0x000020007fa7d879 in thread_native_entry(Thread*) ()
#31 0x000020007ec8b12d in start_thread (arg=<optimized out>) at pthread_create.c:442
#32 0x000020007ed0cbc0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) frame 10
#10 0x00000000403a18c1 in sched::wait_object<waitqueue>::arm (this=this@entry=0x400003733a40) at core/waitqueue.cc:24
24          fifo.newest->next = &_wr;
(gdb) p _wr
$1 = {<waiter> = {t = std::atomic<sched::thread *> = { 0x400003719040 }}, next = 0x0}
(gdb) p fifo
$2 = (struct {...} &) @0x6000048b8b30: {oldest = 0x400003761a80, newest = 0x0}
(gdb) p fifo.newest
$3 = (wait_record *) 0x0
(gdb) frame 15
#15 futex (uaddr=uaddr@entry=0x2000840230b8, op=op@entry=137, val=val@entry=0, timeout=timeout@entry=0x20009fdd9b20, uaddr2=uaddr2@entry=0x0, val3=val3@entry=4294967295) at linux.cc:110
110                     sched::thread::wait_for(queues_mutex, tmr, q);
(gdb) p q
$4 = (waitqueue &) @0x6000048b8b30: {_waiters_fifo = {oldest = 0x400003761a80, newest = 0x0}}

Initially, I thought the arm() method is buggy:

void wait_object<waitqueue>::arm()
{
    auto& fifo = _wq._waiters_fifo;
    if (!fifo.oldest) {
        fifo.oldest = &_wr;
    } else {
        fifo.newest->next = &_wr;
    }
    fifo.newest = &_wr;
}

and changing it to:

void wait_object<waitqueue>::arm()
{
    auto& fifo = _wq._waiters_fifo;
    if (!fifo.oldest) {
        fifo.oldest = &_wr;
    } else if (fifo.newest) {
        fifo.newest->next = &_wr;
    }
    fifo.newest = &_wr;
}

would fix it but instead it led to another crash in disarm() this time:

OSv v0.57.0-109-gcf0f7526
eth0: 172.18.0.2
Booted up in 180.79 ms
Cmdline: /lib64/ld-linux-x86-64.so.2 /usr/lib/jvm/java/bin/java -cp /java-httpserver HttpServerApp
syscall(): unimplemented system call 334
syscall(): unimplemented system call 439
Listening on port 8000 ...

page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x00000000403a1908 <sched::wait_object<waitqueue>::disarm()+40>
RFL: 0x0000000000010212  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000142  RBX: 0x0000400003619040  RCX: 0x0000000000000142  RDX: 0x0000400003633a80
RSI: 0x00006000031a1db0  RDI: 0x0000400003633a40  RBP: 0x0000400003633af0  R8:  0x0000400000024150
R9:  0x0000400000024150  R10: 0x0000400000935190  R11: 0x000040000361e190  R12: 0x00006000031a1db0
R13: 0x0000400000024150  R14: 0x0000400000935190  R15: 0x000040000361e190  RSP: 0x00004000036339f8
Aborted

[backtrace]
0x0000000040205287 <???+1075860103>
0x0000000040306b5b <page_fault+139>
0x0000000040305892 <???+1076910226>
0x00000000403692e0 <syscall+5984>
0x000000004036c18a <syscall_wrapper+122>
0x000000004032c681 <???+1077069441>
0x000020007ee87d15 <???+2129165589>
qemu failed.

addr2line -e build/release/loader.elf 0x00000000403a1908
/home/wkozaczuk/projects/osv-true-master/core/waitqueue.cc:47

see disarm() below:

 29 void wait_object<waitqueue>::disarm()
 30 {
 31     auto& fifo = _wq._waiters_fifo;
 32     if (_wr.woken()) {
 33         return;
 34     }
 35     // wr is still in the linked list, so remove it:
 36     wait_record** pnext = &fifo.oldest;
 37     wait_record* newest = nullptr;
 38     while (*pnext) {
 39         if (&_wr == *pnext) {
 40             *pnext = _wr.next;
 41             if (!*pnext || !(*pnext)->next) {
 42                 fifo.newest = newest;
 43             }
 44             break;
 45         }
 46         newest = *pnext;
 47         pnext = &(*pnext)->next;
 48     }
 49 }

Then I realized that the original crash showed this state of fifo:

{_waiters_fifo = {oldest = 0x400003761a80, newest = 0x0}}

and this is wrong - if newest is 0 then oldest should also be 0 if I understand how this linked list should work. So something corrupts it. What?

Here is my theory: the 2 other places that modify fifo - void waitqueue::wake_one(mutex& mtx) and void waitqueue::wake_all(mutex& mtx) seem to work fine. But the disarm() is buggy I think.

Imagine this scenario: the waitqueue has two wait_records and oldest points to the older one and newest to the newer one and the older one next points to the newer one. Then the disarm() is called with _wr equal to the fifo.oldest so the loop ends up doing this:

The correct version should look like this I think:

void wait_object<waitqueue>::disarm()
{
    auto& fifo = _wq._waiters_fifo;
    if (_wr.woken()) {
        return;
    }
    // wr is still in the linked list, so remove it:
    wait_record** pnext = &fifo.oldest;
    wait_record* older = nullptr;
    while (*pnext) {
        if (&_wr == *pnext) {
            *pnext = _wr.next;
            if (!*pnext) {
                fifo.newest = older;
            }
            if (!older) {
                fifo.oldest = *pnext;
            }
            break;
        }
        older = *pnext;
        pnext = &(*pnext)->next;
    }
}

After applying this patch I could not see these crashes. But who knows I may be missing something else and/or my reasoning is wrong here.

wkozaczuk commented 9 months ago

Could it possibly be related to #1024 and/or #951?

wkozaczuk commented 9 months ago

Also, I think it should not be difficult to create unit tests to validate all this provided we understand how exactly this linked list in waitqueue should work. The unit tests would not even need to do any thread synchronization logic, just create an empty waitqueue and wait_objects and only exercise arm() and disarm() (the woken() should return false = just keep non-null t in wait_records) and assert on waitqueues newest and oldest and possibly walk next just to validate things work correctly.

nyh commented 9 months ago

Wow, good catch, I think you are right, the loop in disarm() indeed seems wrong when _wr is last (newest) link in the chain. I haven't looked at this code in many years (@avikivity wrote it 9 years ago), so maybe I'm missing something. When you have a final version of your proposed patch, I'll try to review it. Please note that in the beginning, pnext points to fifo.oldest, so "pnext = _wr.next" sets fifo.oldest, so your code `fifo.oldest = pnext;` should be redundant, I don't know if it matters or not (as I said I didn't fully review your patch).

Do you have a theory how this bug might have caused https://github.com/cloudius-systems/osv/issues/1024 or https://github.com/cloudius-systems/osv/issues/951? Maybe it can somehow allow the same object to be woken twice?

wkozaczuk commented 9 months ago

I think you are right - this part is redundant:

            if (!older) {
                fifo.oldest = *pnext;
            }

No, the loop is wrong when _wr is the oldest in the chain and there are 2 wait records in the waitqueue.

I could not figure out what the intention was behind the 2nd part of this statement if (!*pnext || !(*pnext)->next)? Maybe @avikivity still remembers after these 9 years :-). It looks like dropping the 2nd part and making it if (!*pnext) like in my patch fixes this bug.

Anyway, I will try to write those unit tests to validate and prove my theory.

avikivity commented 9 months ago

I think you are right - this part is redundant:

            if (!older) {
                fifo.oldest = *pnext;
            }

No, the loop is wrong when _wr is the oldest in the chain and there are 2 wait records in the waitqueue.

I could not figure out what the intention was behind the 2nd part of this statement if (!*pnext || !(*pnext)->next)? Maybe @avikivity still remembers after these 9 years :-). It looks like dropping the 2nd part and making it if (!*pnext) like in my patch fixes this bug.

I absolutely don't remember, and you shouldn't trust the memory of someone who can't code a singly linked list either.