cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.1k stars 604 forks source link

Hung httpserver while load is high on UP #502

Open nyh opened 10 years ago

nyh commented 10 years ago

Vlad reported the following problem, which is potentially very serious.

On a single vCPU, with very high load caused by cassandra stressed 400 client threads, "top.py" almost always hangs after a few seconds.

If one does "curl http://...:8000/os/threads" (the http request behind top), we see a large part of the response (the response is around 80 K), but then it hangs in the middle of the response.

So far this could have easily been an httpserver bug, but the real frightening part is that as soon as the load generation is killed, the response completes and curl returns! I.e., it "smells" like the scheduler at some point doesn't run the httpserver thread at all when it has any contenders. If this is indeed the case, it's a serious bug in the scheduler. Could httpserver have somehow reached infinite runtime?

We need to do more investigation to figure out the circumstances and cause of this bug.

nyh commented 10 years ago

I am able to reproduce this problem, running Cassandra with make modules=cassandra and then

sudo scripts/run.py -c1 -nvV -bbridge4 -m4G

And on the client machine, do

cassandra-stress -d 10.0.4.176 -o INSERT
cassandra-stress -t1000 -d 10.0.4.176 -o READ

and in parallel try an HTTP request:

curl http://10.0.4.176:8000/os/threads

I no longer believe this is a scheduler problem. When httpserver "hangs" on a request (it doesn't always hang, just sometimes), it is not in queued state and not getting the CPU. Rather, it is in the "waiting" state, waiting on an epoll_wait() which never finishes until the load goes down. (when the load is stopped down, it takes curl several seconds until it starts going forward again and quickly finishes).

The gdb stack trace on the waiting httpserver threads shows this:

#7  0x00000000003fb061 in epoll_wait (epfd=<optimized out>, events=0x2000002feed0, maxevents=128, timeout_ms=-1) at /data/nyh/osv/core/epoll.cc:337
#8  0x0000100000c720c8 in ?? ()
#9  0xffffa0001b932f40 in ?? ()
#10 0x000000000000330d in ?? ()
#11 0x00002000002fef40 in ?? ()
#12 0x0000000000238484 in deallocate (this=<optimized out>, __p=0x2000002ff540) at /usr/include/c++/4.8.3/ext/new_allocator.h:110
#13 _M_deallocate (this=<optimized out>, __n=<optimized out>, __p=0x2000002ff540) at /usr/include/c++/4.8.3/bits/stl_vector.h:174
#14 ~_Vector_base (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/4.8.3/bits/stl_vector.h:160
#15 ~vector (this=<synthetic pointer>, __in_chrg=<optimized out>) at /usr/include/c++/4.8.3/bits/stl_vector.h:416
#16 kern_sendit (s=<optimized out>, mp=0x2000002ff5d0, flags=-28672, control=0x2000002ff800, bytes=0xa99c19) at /data/nyh/osv/bsd/sys/kern/uipc_syscalls.cc:527

Line uipic_syscalls.cc:527 is the last line of kern_sendit, returning from this function with an error. Apparently, the end of this function with an error case causes some C++ vector to be destroyed which in turn leads to a call of epoll_wait() that will hang. I still don't know the details of why this is happening.

tgrabiec commented 10 years ago

@nyh There are some unresolved addresses between the epoll_wait frame and deallocate, some of them really weird, so I'm not sure we can trust this backtrace. Can you try with osv syms ?

nyh commented 10 years ago

On Wed, Sep 24, 2014 at 12:34 PM, Tomasz Grabiec notifications@github.com wrote:

@nyh https://github.com/nyh There are some unresolved addresses between the epoll_wait frame and deallocate, some of them really weird, so I'm not sure we can trust this backtrace. Can you try with osv syms ?

This is with "osv syms". You're right about this backtrace being suspicious. I tried another time, and got the hang again on epoll_wait(), but this time, without kern_sendit() being implicated. I still don't know what to make of this.

Nadav Har'El nyh@cloudius-systems.com

nyh commented 10 years ago

Unfortunately, this is not a new regression - I went back to commit 48881fa3578a0483832b17d78d7c20145d44618b two and a half months ago, and we had the same bug. It is harder for me to go even more back in time, because only around that time we added the httpserver we're testing here.

nyh commented 7 years ago

This may be related to issue #824 which I solved a few days ago, which was due to a bug in commit ad94d796143698de1538b1ac2922b23488586153 which was committed in February 2, 2014 - seven months before this bug was filed. The reproducer in #824 involved a sleeping read, not a sleeping epoll, so I'm not entirely sure the same bug is responsible. It's time to run the cassandra reproducer I had (see above) again, and see if the bug is gone.