Golang httpserver app crashes occasionally on startup

wkozaczuk commented 1 year ago

I have been recently running various tests using capstan and the scripts described in this Wiki and I would encounter occasional (once in 3-5 runs) crashes of the older golang httpserver app (version 1.12.6) looking like this:

OSv v0.57.0-22-g4105f979
eth0: 192.168.122.15
Booted up in 171.26 ms
Cmdline: runscript /run/default;
runscript expand fn='/run/default' line=0 '/httpserver.so'
Aborted

[backtrace]
0x0000000040217d05 <???+1075936517>
0x00000000403cdfae <osv::handle_mmap_fault(unsigned long, int, exception_frame*)+30>
0x00000000402ab46a <mmu::vm_fault(unsigned long, exception_frame*)+298>
0x00000000402fb04f <page_fault+143>
0x00000000402f9f06 <???+1076862726>
0x0000100000196cd1 <???+1666257>
0x0000100000180653 <???+1574483>
0x0000100000186681 <???+1599105>
0x0000100000186600 <???+1598976>
0x00001000003a820b <???+3834379>
0x0000200000360f37 <???+3542839>

These crashes would seem to result in the same stack trace every time and when connected to gdb it would look like these:

(gdb) bt
#0  0x00000000403018d2 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:61
#2  osv::halt () at arch/x64/power.cc:29
#3  0x0000000040238960 in abort (fmt=fmt@entry=0x405ab9c3 "Aborted\n") at runtime.cc:145
#4  0x0000000040202e91 in abort () at runtime.cc:106
#5  0x0000000040217d06 in osv::generate_signal (siginfo=..., ef=<optimized out>) at libc/signal.cc:130
#6  0x00000000403cdfdf in osv::handle_mmap_fault (addr=addr@entry=0, sig=sig@entry=11, ef=ef@entry=0x40007ff8e068) at libc/signal.cc:145
#7  0x00000000402ab46b in mmu::vm_sigsegv (ef=0x40007ff8e068, addr=0) at core/mmu.cc:1428
#8  mmu::vm_fault (addr=0, addr@entry=6, ef=ef@entry=0x40007ff8e068) at core/mmu.cc:1456
#9  0x00000000402fb050 in page_fault (ef=0x40007ff8e068) at arch/x64/mmu.cc:42
#10 <signal handler called>
#11 0x0000000000000006 in ?? ()
#12 0x0000100000196d4b in reflect.(*rtype).Field (t=0x200000360eb8, i=824633723136, ~r1=...) at /usr/lib/golang/src/reflect/type.go:918
#13 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  0x00000000402fb050 in page_fault (ef=0x40007ff8e068) at arch/x64/mmu.cc:42
42              mmu::vm_fault(addr, ef);
(gdb) p *ef
$1 = {r15 = 70370891174072, r14 = 70370891173952, r13 = 70368744308800, r12 = 105553128632976, r11 = 518, r10 = 0, r9 = 1082399160, r8 = 70368764401728, rbp = 35184375631568, rdi = 0, 
  rsi = 35184375631544, rdx = 35184374584120, rcx = 6, rbx = 1048576, rax = 0, error_code = 16, rip = 6, cs = 8, rflags = 66054, rsp = 35184375631512, ss = 16}
(gdb) p addr
$2 = 6

When looking at other golang threads I would typically find one in the middle of creating the large syscall stack (see sched::thread::setup_large_syscall_stack() below):

(gdb) bt
#0  sched::thread::switch_to (this=0x4000008ed040, this@entry=0x40007fe09040) at arch/x64/arch-switch.hh:107
#1  0x0000000040365787 in sched::cpu::reschedule_from_interrupt (this=0x400000056040, called_from_yield=<optimized out>, preempt_after=...) at core/sched.cc:377
#2  0x0000000040365d35 in sched::cpu::schedule () at include/osv/sched.hh:1400
#3  sched::thread::wait (this=this@entry=0x400002071040) at core/sched.cc:1353
#4  0x0000000040343ba6 in sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, waiter::wait() const::{lambda()#1}>(sched::thread::dummy_lock&, waiter::wait() con
st::{lambda()#1}) (mtx=<synthetic pointer>..., pred=...) at include/osv/sched.hh:1134
#5  sched::thread::wait_until<waiter::wait() const::{lambda()#1}>(waiter::wait() const::{lambda()#1}) (pred=...) at include/osv/sched.hh:1145
#6  waiter::wait (this=0x600001d63980) at include/osv/wait_record.hh:42
#7  lockfree::mutex::lock (this=this@entry=0x40841900 <memory::free_page_ranges_lock>) at core/lfmutex.cc:89
#8  0x00000000403521d7 in std::lock_guard<lockfree::mutex>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/11/bits/std_mutex.h:228
#9  lock_guard_for_with_lock<lockfree::mutex>::lock_guard_for_with_lock (lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#10 memory::malloc_large (size=69632, alignment=16, block=true, contiguous=false) at core/mempool.cc:910
#11 0x00000000403531a2 in std_malloc (size=65536, alignment=16) at core/mempool.cc:1854
#12 0x00000000403535b7 in malloc (size=65536) at core/mempool.cc:2060
#13 0x00000000403627cc in sched::thread::setup_large_syscall_stack (this=0x400002071040) at arch/x64/arch-switch.hh:269

What is interesting these crashes would not happen if the golang-httpserver or golang-pie-httpserver app was built with a fairly recent go like 1.16.15 (or at least I could not reproduce it). Given one of the thread stack traces involved sched::thread::setup_large_syscall_stack() and the fact those crashes would never happen after the app started successfully, made me think something wrong sometimes happens when OSv handles the 1st syscall call on given thread and needs to allocate a large syscall stack to swap to from the tiny one which is handled by sched::thread::setup_large_syscall_stack(). I remembered that 5 years ago when I implemented this code I measured it would need around 600 bytes of tiny stack to execute this code: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/arch/x64/arch-switch.hh#L21-L30 which meant that 1024 bytes should be enough. But maybe things have changed since and now in the worst-case scenario we need more than 1024. So initially I thought that bumping the tiny stack size to 1 full page (less than that but greater than 1024 would still result in 4K anyway) would solve this issue. And it seemed like it did (the crashes would disappear) but after digging more - I put a canary logic to see if the bottom of the tiny stack ever gets overwritten - I would never see that happen. In worst cases the sched::thread::setup_large_syscall_stack() would seem to use around 750 bytes of stack and assert for the canary would never trigger. So that would seem to contradict the theory about the "too small tiny stack".

Now I realized that the golang code on the thread that triggers a page fault (trying to execute an instruction at the address 6 (always the same address), would seem to involve some floating point arithmetic:

(gdb) disas 0x0000100000186681-32,0x0000100000186681+32
Dump of assembler code from 0x100000186661 to 0x1000001866a1:
   0x0000100000186661 <strconv.fmtF+833>:   mov    0x20(%rsp),%eax
   0x0000100000186665 <strconv.fmtF+837>:   mov    %rax,(%rsp)
   0x0000100000186669 <strconv.fmtF+841>:   lea    0x28(%rsp),%rax
   0x000010000018666e <strconv.fmtF+846>:   mov    %rax,0x8(%rsp)
   0x0000100000186673 <strconv.fmtF+851>:   call   0x10000018a400 <type..eq.strconv.NumError+224>
   0x0000100000186678 <strconv.fmtF+856>:   call   0x1000001abd90 <crypto/aes.gcmAesEnc+1456>
   0x000010000018667d <strconv.fmtF+861>:   call   0x100000180630 <math.frexp+48>
   0x0000100000186682 <strconv.fmtF+866>:   lea    0x555337(%rip),%rax        # 0x1000006db9c0
   0x0000100000186689 <strconv.fmtF+873>:   mov    0x10(%rsp),%rcx
   0x000010000018668e <strconv.fmtF+878>:   cmp    %rax,0x30(%rcx)
   0x0000100000186692 <strconv.fmtF+882>:   je     0x1000001866f6 <strconv.fmtF+982>
   0x0000100000186694 <strconv.fmtF+884>:   mov    0x30(%rcx),%rbx
   0x0000100000186698 <strconv.fmtF+888>:   mov    0xb8(%rbx),%rdx
   0x000010000018669f <strconv.fmtF+895>:   test   %rdx,%rdx
End of assembler dump.

Now, the sched::thread::setup_large_syscall_stack() calls memcpy() which may use SSE instructions: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/arch/x64/arch-switch.hh#L263-L292

and we know that malloc() may "sleep" and thus go through the scheduler code which as I understand uses floating point arithmetic. This made me think we may be experiencing occasional scenarios where some FPU-related registers get clobbered by sched::thread::setup_large_syscall_stack() and cause the crash. This would make sense given we do not save/restore FPU state as we do in the syscall() function: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/linux.cc#L430-L436. And we ought to do the same when calling sched::thread::setup_large_syscall_stack() and thread::free_tiny_syscall_stack() which both call OSv malloc/free and other code possibly clobbering FPU registers.

I cannot prove with 100% certainty that not saving FPU was causing the crashes given that bumping the size of the tiny stack from 1K to 4K seems to have made the issue go away. Possibly, we got lucky because the memcpy() would use different instructions maybe? But given syscall should preserve FPU, I think it is the correct thing to do to save/restore it in both sched::thread::setup_large_syscall_stack() and thread::free_tiny_syscall_stack(). The penalty would only be paid once per thread that uses syscalls. The bigger penalty is that we need a bigger "tiny" stack - probably 4K should be enough (I measured that FPU state needs around 850 extra bytes of stack).

nyh commented 1 year ago

Very nice research work, I think you got both the problem and the solution. Indeed in the past in many different contexts (both while I was working on nest KVM and while I was working on the OSv kernel), seeing random segfaults in an already-debugged application was a telltale sign of forgotten FPU saving. In the nested KVM example, I used to run long compilations in the VM knowing that compilations used a lot of floating point operations (not for actual floating point - more like the SSE vectorized string operations you mentioned) and would often crash because of FPU bugs.

In syscall() itself we already use sched::fpu_lock to save the FPU state. You are right that it was a mistake that we call setup_large_syscall_stack() before syscall() but don't save the FPU state. Like you said, given that setup_large_syscall_stack() (and free_tiny_syscall_stack()) are only called once per thread (which actually uses syscall()), I agree that adding fpu::lock to these functions would be a good and easy solution with negligible overhead.

It's good that you noticed that if we use fpu::lock this will require a bigger "tiny stack" (maybe it's worth adding the "canary" code you added to help catch bugs in the future?). There can be a solution for that, but I'm not sure if it's worth the hassle of writing more code if just increasing the 1K "tiny stack" to 2K is enough. The solution I think we can do to keep the "tiny stack" just 1K (or even substantially less) is to use a per-CPU "temporary stack" just for running setup_large_syscall_stack() . We already have exactly such a stack allocated, the arch_cpu::percpu_exception_stack but I'm not sure we can or if it's wise to reuse it (can we have an exception while setup_large_syscall_stack() is running?) but if not we can have another one like it (you don't have to put it in arch-cpu.hh, you can use generic per-cpu support from osv/percpu.hh. Then we can switch to that temporary stack, save the FPU state on it and allocate the new stack.

wkozaczuk commented 1 year ago

Yes, I thought about this per-cpu stack solution at some point. But I am struggling to understand how we would use it when handling setup_large_syscall_stack(). Both this and free_tiny_syscall_stack() may sleep (because the 1st one calls malloc() and the 2nd one calls free()) so they may be preempted to another thread also in the middle of the setup of a large syscall stack on the same CPU. Two threads cannot use the same stack, can they? Disable preemption during setup_large_syscall_stack()? But is it allowed during malloc and free?

cloudius-systems / osv

Golang httpserver app crashes occasionally on startup #1233