Closed wkozaczuk closed 1 year ago
Very nice research work, I think you got both the problem and the solution. Indeed in the past in many different contexts (both while I was working on nest KVM and while I was working on the OSv kernel), seeing random segfaults in an already-debugged application was a telltale sign of forgotten FPU saving. In the nested KVM example, I used to run long compilations in the VM knowing that compilations used a lot of floating point operations (not for actual floating point - more like the SSE vectorized string operations you mentioned) and would often crash because of FPU bugs.
In syscall() itself we already use sched::fpu_lock to save the FPU state. You are right that it was a mistake that we call setup_large_syscall_stack() before syscall() but don't save the FPU state. Like you said, given that setup_large_syscall_stack() (and free_tiny_syscall_stack()) are only called once per thread (which actually uses syscall()), I agree that adding fpu::lock to these functions would be a good and easy solution with negligible overhead.
It's good that you noticed that if we use fpu::lock this will require a bigger "tiny stack" (maybe it's worth adding the "canary" code you added to help catch bugs in the future?). There can be a solution for that, but I'm not sure if it's worth the hassle of writing more code if just increasing the 1K "tiny stack" to 2K is enough. The solution I think we can do to keep the "tiny stack" just 1K (or even substantially less) is to use a per-CPU "temporary stack" just for running setup_large_syscall_stack() . We already have exactly such a stack allocated, the arch_cpu::percpu_exception_stack but I'm not sure we can or if it's wise to reuse it (can we have an exception while setup_large_syscall_stack() is running?) but if not we can have another one like it (you don't have to put it in arch-cpu.hh, you can use generic per-cpu support from osv/percpu.hh. Then we can switch to that temporary stack, save the FPU state on it and allocate the new stack.
Yes, I thought about this per-cpu stack solution at some point. But I am struggling to understand how we would use it when handling setup_large_syscall_stack()
. Both this and free_tiny_syscall_stack()
may sleep (because the 1st one calls malloc() and the 2nd one calls free()) so they may be preempted to another thread also in the middle of the setup of a large syscall stack on the same CPU. Two threads cannot use the same stack, can they? Disable preemption during setup_large_syscall_stack()
? But is it allowed during malloc and free?
I have been recently running various tests using capstan and the scripts described in this Wiki and I would encounter occasional (once in 3-5 runs) crashes of the older golang httpserver app (version 1.12.6) looking like this:
These crashes would seem to result in the same stack trace every time and when connected to gdb it would look like these:
When looking at other golang threads I would typically find one in the middle of creating the large syscall stack (see
sched::thread::setup_large_syscall_stack()
below):What is interesting these crashes would not happen if the
golang-httpserver
orgolang-pie-httpserver
app was built with a fairly recent go like 1.16.15 (or at least I could not reproduce it). Given one of the thread stack traces involvedsched::thread::setup_large_syscall_stack()
and the fact those crashes would never happen after the app started successfully, made me think something wrong sometimes happens when OSv handles the 1st syscall call on given thread and needs to allocate a large syscall stack to swap to from the tiny one which is handled bysched::thread::setup_large_syscall_stack()
. I remembered that 5 years ago when I implemented this code I measured it would need around 600 bytes of tiny stack to execute this code: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/arch/x64/arch-switch.hh#L21-L30 which meant that 1024 bytes should be enough. But maybe things have changed since and now in the worst-case scenario we need more than 1024. So initially I thought that bumping the tiny stack size to 1 full page (less than that but greater than 1024 would still result in 4K anyway) would solve this issue. And it seemed like it did (the crashes would disappear) but after digging more - I put a canary logic to see if the bottom of the tiny stack ever gets overwritten - I would never see that happen. In worst cases thesched::thread::setup_large_syscall_stack()
would seem to use around 750 bytes of stack and assert for the canary would never trigger. So that would seem to contradict the theory about the "too small tiny stack".Now I realized that the golang code on the thread that triggers a page fault (trying to execute an instruction at the address 6 (always the same address), would seem to involve some floating point arithmetic:
Now, the
sched::thread::setup_large_syscall_stack()
callsmemcpy()
which may use SSE instructions: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/arch/x64/arch-switch.hh#L263-L292and we know that
malloc()
may "sleep" and thus go through the scheduler code which as I understand uses floating point arithmetic. This made me think we may be experiencing occasional scenarios where some FPU-related registers get clobbered bysched::thread::setup_large_syscall_stack()
and cause the crash. This would make sense given we do not save/restore FPU state as we do in thesyscall()
function: https://github.com/cloudius-systems/osv/blob/f4cb709d4380f4f4ead952e14e12cd09280c181d/linux.cc#L430-L436. And we ought to do the same when callingsched::thread::setup_large_syscall_stack()
andthread::free_tiny_syscall_stack()
which both call OSvmalloc/free
and other code possibly clobbering FPU registers.I cannot prove with 100% certainty that not saving FPU was causing the crashes given that bumping the size of the tiny stack from 1K to 4K seems to have made the issue go away. Possibly, we got lucky because the
memcpy()
would use different instructions maybe? But given syscall should preserve FPU, I think it is the correct thing to do to save/restore it in bothsched::thread::setup_large_syscall_stack()
andthread::free_tiny_syscall_stack()
. The penalty would only be paid once per thread that uses syscalls. The bigger penalty is that we need a bigger "tiny" stack - probably 4K should be enough (I measured that FPU state needs around 850 extra bytes of stack).