stack overflow during syscall

yuval-k commented 8 years ago

I am experiencing a stack overflow when with gorump. I believe that the core cause is that the stack is not change when performing a system call, but i am by no means go expert.

I'll describe, when running our unik example_go_static_fileserver on aws (or xen for that matter) we get the following error:

...
Page fault at linear address 0x404216150, rip 0x162f48, regs 0x4221c08, sp 0x4221cb0, our_sp 0x4221bf0, code 0
Thread: lwp
RIP: e030:[<0000000000162f48>]
RSP: e02b:0000000004221cb0  EFLAGS: 00010206
RAX: 0000000404216120 RBX: 0000000000000000 RCX: 0000000000000017
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000001880000 R09: 0000000057aa24a3
R10: 000000002c523760 R11: 0000000000000000 R12: 00000000013ab690
R13: 0000000000000000 R14: 0000000000000003 R15: 0000000001940e00

I'll save you the long days of single stepping due to the lack of watchpoint support from xen gdb stub and get to the root cause.

We have two go routines that do read\write. These go routines are created right after the other and have 2kb of stack space allocation adjacent to each other.

while the first goroutine is waiting for IO, the second one runs. it calls on the write syscall. [at this point from my understanding the stack should change to the system stack, as syscalls are not aware of any go stack business. This does not happen]

The functions that are related to the syscall (specifically, write to the xen console) run on the same 2kb stack. Unaware of go's stack struction the second goroutine that is now running C code, runs out of stack space and overwrites the first goroutine's stack. The bug is only detect when the first goroutine resumes and crashes.

Note the RAX in the dump: 0000000404216120 0x4216120 is the original value (the value is a pointer and the variable is stored on the stack) 0x00000004 was written by the second go routine, during the stack overflow.

To test, i doubled go's stack size, and everything seems to work. this is the fastest solution i can think of, but it is also the hackiest.

eyberg commented 8 years ago

thanks for this bug report - I assume this is for the 1.5 version?

it might take me a bit to look-at/integrate the fix -- I'll try to block some time but if you or anyone else watching the thread wants to get it in sooner we'd want the following:

simple test code that crashes
test we can run against CI - example so far https://github.com/deferpanic/gorump/blob/master/test/gc_test/verify_gc.sh
PR reflecting https://github.com/emc-advanced-dev/unik/commit/43aa24497cdb8c05400c3eacf9fe63c700c2733e#diff-c6773b19d6b4be6de1809ef8fec87178R1050 change isolated to a rumprun build tag file - I believe the arches can be specified in the original

yuval-k commented 8 years ago

@eyberg I don't think you should merge my commit - The real fix i believe is to use the rump system stack for rump sys calls; Unfortunately I don't have enough go\rump knowledge to do that

deferpanic / gorump

stack overflow during syscall #42