cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.08k stars 602 forks source link

Futex crashes under heavy load #951

Open nyh opened 6 years ago

nyh commented 6 years ago

This issue was discovered by @benoit-canet already in February 2017, and recently resurfaced by @wkozaczuk. Benoit reported that when running a Golang HTTP server under load, he saw crashes in the futex code.

It is worth noting why we discovered this problem only in Go code. As I explained in issue #853, the futex code was very lightly used by general Linux applications, but could be much more heavily used in Go - in Benoit's benchmark it was used as much as 10,000 times per second. And it's possible that with such heavy use, more bugs get exposed.

I believe the bug is as follows: The FUTEX_WAIT code (see futex() in linux.cc) does this:

waitqueue &q = queues[uaddr];
sched::thread::wait_for(queues_mutex, tmr, q);

So, while we wait on this queue, the queues_mutex is released, and a FUTEX_WAKE can run in parallel (in fact, that's the whole point...). When FUTEX_WAKE happens, it will find this queue, wake_one() on it, and if the queue becomes empty, it deletes the queue object.

The bug in this is that waking the waiting thread doesn't guarantee that it will run first (obviously). If it doesn't, the wait_for() of FUTEX_WAIT is woken up with the object q already destroyed. And that is a bug, because when wait_for() wakes up (see sched.hh) it calls the disarm() method on the waitqueue. And that will crash if done after the waitqueue object was already destroyed.

@benoit-canet already sent a patch for this bug, where he used a shared_ptr instead of the waitqueue object directly: FUTEX_WAKE will destroy the shared_ptr, but the actual waitqueue will be not be destroyed until the FUTEX_WAIT code which holds a copy of it will also drop it, after returning from the wake.

I think there may be a more efficient solution to this bug (how about the waiter removing the empty queue, not the waker?), but anyway the performance of this futex implementation sucks (see #853) so adding a shared_ptr will not make a noticable difference, so we can consider Benoit's patch.

nyh commented 6 years ago

Unfortunately, I am not able to reproduce this bug. I wonder if this trivial patch, moving the erasing of the unused queue to the waiter instead of the waker, solves this bug in a simpler way than Benoit's patch with the extra reference counting?

diff --git a/linux.cc b/linux.cc
index 25d28ee0..bf2d99ab 100644
--- a/linux.cc
+++ b/linux.cc
@@ -77,6 +77,9 @@ int futex(int *uaddr, int op, int val, const struct timespec *
timeout,
                     tmr.set(std::chrono::seconds(timeout->tv_sec) +
                             std::chrono::nanoseconds(timeout->tv_nsec));
                     sched::thread::wait_for(queues_mutex, tmr, q);
+                    if (q.empty()) {
+                        queues.erase(uaddr);
+                    }
                     // FIXME: testing if tmr was expired isn't quite right -
                     // we could have had both a wakeup and timer expiration
                     // racing. It would be more correct to check if we were
@@ -87,6 +90,9 @@ int futex(int *uaddr, int op, int val, const struct timespec *timeout,
                     }
                 } else {
                     q.wait(queues_mutex);
+                    if (q.empty()) {
+                        queues.erase(uaddr);
+                    }
                 }
                 return 0;
             } else {
@@ -108,9 +114,6 @@ int futex(int *uaddr, int op, int val, const struct timespec *timeout,
                     i->second.wake_one(queues_mutex);
                     waken++;
                 }
-                if(i->second.empty()) {
-                    queues.erase(i);
-                }
                 return waken;
             }
         }
wkozaczuk commented 6 years ago

I could not reproduce it neither using my httpserver test with ab stressing it even with 1000 concurrent requests (when reverted Benoit's patch). My golang stress tests also continued to work with you patch.

I am guessing your change should not hurt as the theoretical scenario you are describing may happen.

On Wed, Mar 7, 2018 at 6:55 PM, nyh notifications@github.com wrote:

Unfortunately, I am not able to reproduce this bug. I wonder if this trivial patch, moving the erasing of the unused queue to the waiter instead of the waker, solves this bug in a simpler way than Benoit's patch with the extra reference counting?

diff --git a/linux.cc b/linux.cc index 25d28ee0..bf2d99ab 100644--- a/linux.cc+++ b/linux.cc@@ -77,6 +77,9 @@ int futex(int uaddr, int op, int val, const struct timespec timeout, tmr.set(std::chrono::seconds(timeout->tv_sec) + std::chrono::nanoseconds(timeout->tv_nsec)); sched::thread::wait_for(queues_mutex, tmr, q);+ if (q.empty()) {+ queues.erase(uaddr);+ } // FIXME: testing if tmr was expired isn't quite right - // we could have had both a wakeup and timer expiration // racing. It would be more correct to check if we were@@ -87,6 +90,9 @@ int futex(int uaddr, int op, int val, const struct timespec timeout, } } else { q.wait(queues_mutex);+ if (q.empty()) {+ queues.erase(uaddr);+ } } return 0; } else {@@ -108,9 +114,6 @@ int futex(int uaddr, int op, int val, const struct timespec timeout, i->second.wake_one(queues_mutex); waken++; }- if(i->second.empty()) {- queues.erase(i);- } return waken; } }

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cloudius-systems/osv/issues/951#issuecomment-371328051, or mute the thread https://github.com/notifications/unsubscribe-auth/AFDSISXMoNcfcJImeyRu62OXerngQuPVks5tcHNdgaJpZM4Sg1Em .

nyh commented 6 years ago

The more I look into the code, the less do I understand why what I thought was a bug (with the waker, instead of the waiter, deleting the wait queue) is really bug...

Unlike what I said above, it seems that wait_for() on a wait_queue alone does not actually need the queue to be still alive when waking up. The waiter supplies a "wait record" to the queue and the wait is on that, and the waker is the one who removes this wait record from the wait queue - so the waiter wakes up and only needs to check its wait record - not the wait queue which isn't touched at all. It seems the waiter only needs to touch the queue after being woken up if some other reason (i.e., timeout) caused this wakeup, and the waiter (in the disarm() code) needs to remove its own wait record from the queue. So the next guess would be that the bug happens during a race between timeout and wakeup, but even then I don't understand how it can happen: in the existing code, wait_queue::wake() will only delete the queue after having woken the wait_record, and in wait_object<waitqueue>::disarm(), if the wait_record is already woken, it returns immediately and does not touch the queue at all... There is on suspicious part in waitqueue.cc in disarm() does:

    auto& fifo = _wq._waiters_fifo;
    if (_wr.woken()) {
        return;
    }
    ...

In what seems to be the "wrong" order - it should have checked woken() first - but even this I don't think is a problem, because the code only does not really read _wq, it only makes a copy of its address, as far as I can tell?

Benoit said he reproduced the bug with "wrk" run against an HTTP server in Go. @benoit-canet, can you perhaps try to recall more information on how you reproduced it? Thanks.

wkozaczuk commented 3 years ago

@nyh your last comment suggests you no longer think your original theory is correct. Given we can not reproduce this issue (I still cannot) and this issue does not describe the exact recipe to reproduce it, shall we close it?

nyh commented 3 years ago

Good question. The bug is probably real, and it will be useful for someone who encounters it in the future to be able to search for past guesses on it, but I agree that it's bad form that this issue mixes the issue description with a lot of wild and/or incorrect guesses :-( If this issue really bothers you you can close it, but if not - I suggest we keep it. Maybe we should create a new tag for issues that we can't reproduce, to make it clearer that this is probably a real issue but we have no idea how to proceed with it.