Closed rennergade closed 8 months ago
Max has implemented kill() and sigaction() and has passed that off to @jesings who's going to work on the signal handler. Max now is going to work on implementing sigprocmask().
Currently I am able to receive a signal while the receiving thread is in untrusted code and then dispatch to a signal handler in untrusted code. In order to be able to return from that signal handler we need to do some return address hijacking on the stack to some trampoline stub that will restore the registers from their previous state. Additionally we need to figure out what we do when we receive a signal when the thread is in trusted code--we can't immediately process it because what if we're i.e. in the middle of modifying the vmmap. We should discuss how to implement this.
I am looking to implement this by adding an untrusted function on the springboard whose return address is placed onto the stack right below the stack frame for the handler. The NaCl signal handler currently does the following relevant things: It starts a new stack frame below the red zone of the stack frame that received the signal By creating a new stack pointer address for untrusted And then subtracts more space from that stack pointer address In order to have enough room to store this struct which contains, among other things, all of the registers It then populates the registers and other stuff into that stack The stack frame for the signal handling function starts below this one So when we return from the signal handler, these contents will still be on the stack And so we should return from the signal handler to some function that can actually populate these into our registers
Great progress, thank you! I think this makes sense as a way to handle returns.
@moyix doe this make sense to you? any concerns or advice?
This register restoration technique actually won't work safely either in cagespace or in trusted without clobbering a register, which would be bad.
The issue is that when we return from the cagespace signal handler, we would execute some code that would need to restore registers, and after that code was finished we would need to return execution back to the point in cagespace that we were right before the signal was received. This code could either be in cagespace or trusted, and this problem would hold either way.
We need to restore every single register before finally restoring the instruction pointer, as once we restore the instruction pointer we start executing the pre-signal cagespace code again.
Once we have restored every register except the instruction pointer, we are left in a situation where we cannot clobber any of these registers as we need them to be exactly as they were before cagespace received the signal.
x86 does not allow you to change the instruction pointer without either using a fixed address jmp
(which obviously doesn't work for our case), a different register for an indirect jmp
, or using a ret
to pop a return address from the stack and put it into %rip.
Using a register for an indirect jump clearly doesn't work as that would require clobbering it.
Returns from the stack are fraught because at this point the stack pointer points to cagespace memory which in order to safely use as a return address needs to have its top 32 bits masked off by the value in %r15, which is handled in nacl (in the naclret
pseudo-instruction) by using the %r11 register to hold the address, masking the top 32 bits of that register off, and then using it as the target for an indrect jump--this clobbers %r11 so is no good, and in fact any such masking scheme requires the clobbering of a register.
Therefore, accomplishing this without for example adding a new reserved register in nacl-gcc seems like it may not be possible in a satisfactory way.
Couldn't you do something like:
Or is the concern that since the stack is in untrusted memory someone could modify the saved RIP in between masking it and the ret?
I think the other part I don't understand is – if the normal naclret approach clobbers R11, why isn't it a problem for them?
@moyix Yes, the modification of the stack memory in between saving RIP and masking it is indeed the concern. I have that implemented already but I think the TOCTTOU attack really breaks NaCl's security model. The normal naclret is fine because r11 is a caller saved register so there is no problem if it gets clobbered in the return statement. However for signal handlers, we need to restore both caller and callee saved registers so that won't work.
@moyix we're going to investigate if we can reserve another register and its effects. Do you think this is a good idea? We're struggling to come up with any other options.
The only other option I can think of is waiting to deliver the signal until you can guarantee that no threads in the cage are running (i.e. by blocking them from returning to untrusted code), which removes the TOCTOU since no one else is able to write in between.
I don't have a strong opinion on which of these options is easier to implement – also note that burning a register is fairly drastic on an arch like x86 where registers are scarce so I'm not sure how bad the perf hit would be.
Technically there are more exotic options like TSX, but I think those are not widely supported and possibly Intel is going to kill them.
So apparently we can reserve a register without changing the gcc code with the -ffixed-reg flag. I experimented with it and it seems to work.
We can possibly avoid adding overhead in situations where we don't need signal handlers by only compiling with that for programs that use handlers (and disabling sigaction when we don't). May be overkill but its an idea.
As far as testing the performance hit, any ideas for benchmarks?
I would think that the micro and macro benchmarks for the paper would be a good starting place.
Returns from the stack are fraught because at this point the stack pointer points to cagespace memory which in order to safely use as a return address needs to have its top 32 bits masked off by the value in %r15, which is handled in nacl (in the naclret pseudo-instruction) by using the %r11 register to hold the address, masking the top 32 bits of that register off, and then using it as the target for an indrect jump--this clobbers %r11 so is no good, and in fact any such masking scheme requires the clobbering of a register.
I think I may have a solution, but please check it carefully.
I believe that the modifications that happen to r11 and r15 by NaCl code are for memory safety and the concern of returning from the stack is that you may be in the middle of a naclret pseudo instruction when you return from a signal.
My understanding is that r11 is set at the beginning of this pseudo instruction and is modified by r15.
Couldn't we check to see if we are in the middle of this sequence and just rewind the instruction pointer to be at the start of it?
Alternatively, consider the following. I believe that NaCl only uses %r15 for masking and only emits code that has a pattern set %r15 to the mask, mask %r11, jmp, etc. Could we detect if the next instruction would use %r15 for masking and if so rewind the IP by one instruction to where %r15 is set? We should then be able to safely use %r15, I think.
(I haven't thought through this carefully or re-examined the NaCl spec, so please view my suggestion with critical eye...)
While r15 is only used in NaCl code in pseudo instructions for memory safety to mask things, r11 is used as a general purpose register. r11 is also used in the naclret pseudo-instruction which clobbers it, but in the body of a function before the naclret it can be used arbitrarily. The naclcall and nacljmp pseudo instructions don't need r11 because they are given as a parameter a register value to jmp to/call. Thus we can't use either technique, as neither would be a complete solution to the issue. A lot of the micro benchmarks are kinds of workloads that would not benefit at all from 1 more register being available in untrusted, either because most of the program runtime would be within trusted (e.g. pipes), or because they are just doing memcpy/other simple operations which don't even use that many registers. I expect that in integer computation tasks things registers might be helpful but in those I would expect that one could use SIMD. What would be a relatively simple micro-benchmark that would use many registers?
Does r15 get set before it is used each time or is it always assumed to have the correct value?
r15 is assumed to be a constant mask while in untrusted code and is never allowed to be modified except in trusted code, so we couldn't use it either.
As things stand receiving signals and returning from them works in untrusted 100% of the time and works in trusted 80% of the time. There remain two bugs. One of which which makes the program stall after the signal is received appears to be that the signal is received by the wrong thread. This happens maybe 10 percent of the time and I'm not sure exactly what's happening. @rennergade may have some insight. The other guy is when we receive signals in trusted while the trusted thread is trying to switch back to untrusted, we end up restoring the trusted state but in reality need to back up the program to set the untrusted state post syscall return to what we want. Thus we need to detect when we're switching syscalls. As far as I can tell the best way to do this is to unwind the stack and see if NaClSwitchToApp is on the call stack but this seems imperfect in case the gcc used to compile NaCl does something like frame pointer elision. I think it should be ok but if there's another way it's worth trying. I also considered checking the instruction pointer we are trying to return to and seeing if it is within one of a number of functions by taking the start and end address of these functions. This is more finicky and can go wrong. Finally there is a performance concern with storing the signal pending flag in a dash map in rust rather than in a thread local variable which would perform much better.
Thanks @jesings this is awsome! Some responses:
I ran my a test ~25 times and was not able to reproduce either. This first issue with the wrong threads is a bit puzzling. If you can send me a way to reproduce that that would be great. Maybe its a signal from RR?
This one makes sense. Can we implement some sort of flag system so we know which one to change to at the beginning of NaClSwitch? is that possible?
Yeah definitely agree with using a TL variable for this in nacl, think it will be much cleaner and definitely more performant.
So it seems both errors happen when the signal is caught in specific places in trusted--the loop error happens if the signal is received while in the code in trusted before the natp is populated with the user register values, and the seg fault error happens if the signal is caught in parts of trusted code that happen while restoring user register state (i.e. some registers are already restored). We can set a flag between the points where the natp is fully initialized and where we start to copy out so we know that if this flag is set when a signal is received then it's safe to proceed normally. However, if this flag is not set we need to do something different. In the case that the syuscall is caught in the beginning of natp copying, we probably need to restore the natp state from the registers dumped onto the user stack within the signal handler. In the case that the signal is caught on exit of the syscall then we just need to rerun the copying-out-to-user code. However, detecting whether we're in the entry or exit of a syscall is annoying. A simple flag wouldn't work because it takes instructions in trusted to set/unset that flag, and while that flag would be being set/unset there would be points in the code where the flag says we're not in entry when we are. We can try doing stack unwinding manually to see whether we have an enclosing function that signifies the end or start of the syscall. Or we can try to write the entry entirely in assembly and statically check the addresses of the function and check if our program counter is within the addresses of the syscall entry functions. There might be another approach but this is where things currently stand.
Or we can try to write the entry entirely in assembly and statically check the addresses of the function and check if our program counter is within the addresses of the syscall entry functions.
This is what my mind immediately jumped to. In theory we could make sure these are placed in some known, reserved ranges to make this easier (i.e. checking a bitmask on the address).
(This seems to be a reasonable way to do so, but I'm not sure how this works with NaCl's toolchain: https://stackoverflow.com/questions/18778323/how-to-specify-a-memory-location-at-which-function-will-get-stored. You may also need to pad this or do some other trickery so that other things don't get stored there as well. )
We should have control of the function placement so I don't think that part should be an issue, just wondering how difficult the assembly rewrite is going to be.
Why does the handler have to be entirely in assembly? Why not just the address checking portion?
On Mon, Jul 10, 2023 at 9:49 PM Nicholas Renner @.***> wrote:
We should have control of the function placement so I don't think that part should be an issue, just wondering how difficult the assembly rewrite is going to be.
— Reply to this email directly, view it on GitHub https://github.com/Lind-Project/lind_project/issues/296#issuecomment-1629978940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGROD7Z2PSOKNW5LYKBYRTXPSWJVANCNFSM6AAAAAAUXAIJOY . You are receiving this because you commented.Message ID: @.***>
The handler doesn't have to be all in assembly--there are two functions called on syscall entry before the trusted state is suitable to receive a signal normally. We can actually check the address bounds by just putting a label in a c function in inline assembly so it turns out we don't really need to rewrite things in assembly.
So two issues as far as I can tell remain, although we are up to about a 98 percent success rate. The first issue is with the GetTLSFastPath1 and 2 functions which have a completely separate path in and out of trusted code to normal syscalls. I need to do more inquiry into this but it should be solvable. The second issue is a race condition which I want some input on if anyone has. The syscall entry code sets the instruction pointer value to return to when the syscall returns. The signal code modifies this same value. We can check in a syscall if a signal was received already using a flag and decide not to modify this value (and instead modify a separate value), however whatever ways I can think of to do this result in a TOCTTOU error as signals can be received between the check and the set. I have a vague sense that maybe we could use the xchg instruction to allow us to set a value and receive an old value in a single instruction in order to test whether we need to run cleanup code, but this idea isn't fully solidified in my head yet. Does it seem reasonable? Does anyone have any other suggestions for how to avoid this TOCTTOU?
I've unified @jesings changs with the syscall implementations that @AlpacaMax made in the Spring, as well as updating them with the latest develop and adding some glue for everything. The good news is that postgres now runs further, and the Reaper() server loop that we had been faking now runs correctly via signals instead of a hack job.
Unfortunately postgres still seems to hang. @RusherRG and I have tracked the bug down to SIGUSR2 signals not being properly caught from the postgres autovacuum. Here it becomes a headscratcher.
We've set up Lind's signal handling to always send signals, and for the NaCl signal handler SignalCatch to always receive and handle them. There we decide if we want to send them to a user program handler, or mask and resend later, return, or terminate.
However, we never catch SIGUSR2 via SignalCatch, which is deeply confusing. @Yaxuan-w helped me make user tests for SIGUSR2 which seem to work fine. Yet we don't receive them running the postgres/pgbench suite.
The other thing thats striking is it seems that all of the threads are stuck in rustposix's poll()/select() implementation when they are not receiving signals. I'm not sure if this is just a coincidence because postgres seems to heavily use poll/select. This did lead me to fix a bug where we weren't properly doing an EINTR when poll/select received signals, but still no dice on the larger bug.
IMHO, the only thing that would stop SignalCatch from receiving the SIGUSR2 signal is if it was actually masked by the host. The only places that could happen are sigaction (which is only called at setup, definitely not the source of the problem), or returning via the kernels sigreturn or Jonathan's added NaClSysSigreturn for returning from user handlers.
However, combing through these it seems like none of those are happening on the thread before the sigusr2 is sent, and its still not being received.
@RusherRG and I are going to continue to try to solve this one. If anyone has any ideas let us know.
@RusherRG @Yaxuan-w and I fixed the above issues. The no signal problem was solved by properly resetting the signal mask on forks, and we fixed a concurrency issue by implementing semaphores as well as fixing the return values on error from glibc.
There is either one or two errors left thats preventing us from running the full stack. The first one is a known signal bug and was documented by @jesings before he left:
* One rare bug remains, when we are signaled in untrusted (whether after single stepping or not)
* and the last untrusted instruction executed is an indirect call (or probably jmp), we end up
* seg faulting somewhere down the line. I bet the stack is somehow somewhat wonky in this case
* but sadly I have to leave this bug unsolved before I leave.
*/
This may or may not be causing the second bug which is a mysterious RustPOSIX panic! inside of the DashMap iter() routine, which doesn't really make sense. Since that one only happens when signals are enabled I'm going to try to fix the known bug first.
This bug seems to panic w/ a stack address as the program counter, which is definitely wrong. I managed to stop when it catches the sigsegv and get a (very bizzarre) backtrace as well as the register values.
registers:
$5 = {rax = 0x18, rbx = 0x1f036a8, rcx = 0x1024c3f8, rdx = 0x10140450, rsi = 0x1024c3f8, rdi = 0x10140450, rbp = 0x619ffffee4a0, stack_ptr = 0x619ffffee478, r8 = 0xfffee466, r9 = 0x530f, r10 = 0x0, r11 = 0x619ffffee4a0, r12 = 0x0, r13 = 0x193, r14 = 0x619f01a95200, r15 = 0x619f00000000, prog_ctr = 0x619ffffee4a0, flags = 0x10306, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}
backtrace:
Thread 17 (Thread 0x7ffff735c640 (LWP 296795) "sel_ldr"):
#0 SignalCatch (sig=11, info=0x7ffff7360af0, uc=0x7ffff73609c0) at src/trusted/service_runtime/linux/nacl_signal.c:686
#1 <signal handler called>
#2 0x0000619ffffee4a0 in ?? ()
#3 0x0000619f01aa1060 in ?? ()
#4 0x0000000100000010 in ?? ()
#5 0x1024c3f8f63ec0f4 in ?? ()
#6 0x2000000000000001 in ?? ()
#7 0x101404501012a8d0 in ?? ()
#8 0x0000619ffffee510 in ?? ()
#9 0x0000619f01ab3de0 in ?? ()
#10 0x20000000000001f2 in ?? ()
#11 0xfffee83000000001 in ?? ()
#12 0x1013e168fffee850 in ?? ()
#13 0x1012a8d0f63ead20 in ?? ()
#14 0x0100619f1024c3f8 in ?? ()
#15 0x000000001012ed48 in ?? ()
#16 0x0000619ffffee500 in ?? ()
#17 0x100737380142d9e0 in ?? ()
#18 0x1024c3f801774aa0 in ?? ()
#19 0x1012ed4800000001 in ?? ()
#20 0x0000619ffffee520 in ?? ()
#21 0x000000001012a8d0 in ?? ()
#22 0x0000619ffffee540 in ?? ()
#23 0x0000619f010e2240 in ?? ()
#24 0xfffee850fffee830 in ?? ()
#25 0x1012f6a01013e168 in ?? ()
#26 0xfffee57ffffee560 in ?? ()
#27 0x000000001012f6a0 in ?? ()
#28 0x0000619ffffee590 in ?? ()
#29 0x0000619f010e24e0 in ?? ()
#30 0x2000000000000001 in ?? ()
#31 0xfffee801fffeea10 in ?? ()
#32 0xfffee850fffee830 in ?? ()
#33 0xf63ead201013e168 in ?? ()
#34 0x000001f2f6df6e00 in ?? ()
#35 0xfffeea101012ed48 in ?? ()
#36 0x0000000100000000 in ?? ()
#37 0x00000000f63ead20 in ?? ()
#38 0x0000619ffffee950 in ?? ()
#39 0x0000619f010a5de0 in ?? ()
#40 0xfffee8a000000000 in ?? ()
#41 0xfffee860111d2001 in ?? ()
#42 0x0000619ffffee650 in ?? ()
#43 0x0000000001729b01 in ?? ()
#44 0x1013e028fffee501 in ?? ()
#45 0xf63e7ca0f63ead20 in ?? ()
#46 0x0000000000000034 in ?? ()
#47 0x100d544001774a70 in ?? ()
#48 0x0004000300020001 in ?? ()
#49 0x0008000700060005 in ?? ()
#50 0x000c000b000a0009 in ?? ()
#51 0x0010000f000e000d in ?? ()
#52 0x0014001300120011 in ?? ()
#53 0x0018001700160015 in ?? ()
#54 0x001c001b001a0019 in ?? ()
#55 0x0020001f001e001d in ?? ()
#56 0x0024002300220021 in ?? ()
#57 0x0028002700260025 in ?? ()
#58 0x002c002b002a0029 in ?? ()
#59 0x0030002f002e002d in ?? ()
#60 0x0034003300320031 in ?? ()
#61 0x0038003700360035 in ?? ()
#62 0x003c003b003a0039 in ?? ()
I was able to create a minimal example for this bug. The example forks a child, registers a signal hanlder, and then loops continuously over an indirect call. The parent then loops signaling the child. Here is the code:
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
int somefunc()
{
return 0;
}
void handler(int signum)
{
printf("Signal Handler Test!\n");
fflush(stdout);
return;
}
int main(void)
{
struct sigaction new_action;
struct sigaction old_action;
pid_t pid = fork();
if (pid == 0) {
printf("Cage 2 pid: %ld\n", (long)getpid());
printf("Handler function ptr %p\n", handler);
fflush(stdout);
new_action.sa_handler = handler;
sigaction(SIGUSR2, &new_action, NULL);
sigaction(SIGUSR2, NULL, &old_action);
if (new_action.sa_handler == old_action.sa_handler) {
printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
} else {
printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
}
int (*f)() = &somefunc;
while (1) {
f();
}
} else {
// Cage 1
sleep(2);
printf("Killing Cage 2 using PID: %ld\n", (long)pid);
fflush(stdout);
while (1) {
kill(pid, SIGUSR2);
sleep(1);
}
}
}
for reference this example disassembles to:
0000000001000620 <somefunc>:
1000620: 55 push %rbp
1000621: 48 89 e5 mov %rsp,%rbp
1000624: b8 00 00 00 00 mov $0x0,%eax
1000629: 41 5b pop %r11
100062b: 44 89 dd mov %r11d,%ebp
100062e: 4c 01 fd add %r15,%rbp
1000631: 41 5b pop %r11
1000633: 41 83 e3 e0 and $0xffffffe0,%r11d
1000637: 4d 01 fb add %r15,%r11
100063a: 41 ff e3 jmp *%r11
100063d: 0f 1f 00 nopl (%rax)
0000000001000640 <handler>:
1000640: 55 push %rbp
1000641: 48 89 e5 mov %rsp,%rbp
1000644: 83 ec 10 sub $0x10,%esp
1000647: 4c 01 fc add %r15,%rsp
100064a: 89 7d fc mov %edi,-0x4(%rbp)
100064d: bf 38 05 00 11 mov $0x11000538,%edi
1000652: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
1000659: 00 00
100065b: e8 e0 fa ff ff call 1000140 <.plt+0xc0>
1000660: 8b 05 9a f9 01 10 mov 0x1001f99a(%rip),%eax # 11020000 <stdout@GLIBC_2.2.5>
1000666: 89 c7 mov %eax,%edi
1000668: 0f 1f 40 00 nopl 0x0(%rax)
100066c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000673: 0f 1f 84 00 00 00 00
100067a: 00
100067b: e8 40 fd ff ff call 10003c0 <.plt+0x340>
1000680: 48 89 ec mov %rbp,%rsp
1000683: 41 5b pop %r11
1000685: 44 89 dd mov %r11d,%ebp
1000688: 4c 01 fd add %r15,%rbp
100068b: 41 5b pop %r11
100068d: 41 83 e3 e0 and $0xffffffe0,%r11d
1000691: 4d 01 fb add %r15,%r11
1000694: 41 ff e3 jmp *%r11
1000697: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
100069e: 00 00
00000000010006a0 <main>:
10006a0: 55 push %rbp
10006a1: 48 89 e5 mov %rsp,%rbp
10006a4: 81 ec 30 01 00 00 sub $0x130,%esp
10006aa: 4c 01 fc add %r15,%rsp
10006ad: 66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10006b4: 1f 84 00 00 00 00 00
10006bb: e8 c0 fc ff ff call 1000380 <.plt+0x300>
10006c0: 89 45 f8 mov %eax,-0x8(%rbp)
10006c3: 83 7d f8 00 cmpl $0x0,-0x8(%rbp)
10006c7: 0f 85 f3 01 00 00 jne 10008c0 <main+0x220>
10006cd: 66 66 66 66 66 2e 0f data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10006d4: 1f 84 00 00 00 00 00
10006db: e8 60 fb ff ff call 1000240 <.plt+0x1c0>
10006e0: ba 4d 05 00 11 mov $0x1100054d,%edx
10006e5: 89 d2 mov %edx,%edx
10006e7: 89 c6 mov %eax,%esi
10006e9: 89 d7 mov %edx,%edi
10006eb: b8 00 00 00 00 mov $0x0,%eax
10006f0: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
10006f7: 00 00 00 00
10006fb: e8 00 fa ff ff call 1000100 <.plt+0x80>
1000700: b8 5e 05 00 11 mov $0x1100055e,%eax
1000705: 89 c0 mov %eax,%eax
1000707: be 40 06 00 01 mov $0x1000640,%esi
100070c: 89 c7 mov %eax,%edi
100070e: b8 00 00 00 00 mov $0x0,%eax
1000713: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
100071a: 00
100071b: e8 e0 f9 ff ff call 1000100 <.plt+0x80>
1000720: 8b 05 da f8 01 10 mov 0x1001f8da(%rip),%eax # 11020000 <stdout@GLIBC_2.2.5>
1000726: 89 c7 mov %eax,%edi
1000728: 0f 1f 40 00 nopl 0x0(%rax)
100072c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000733: 0f 1f 84 00 00 00 00
100073a: 00
100073b: e8 80 fc ff ff call 10003c0 <.plt+0x340>
1000740: c7 85 60 ff ff ff 40 movl $0x1000640,-0xa0(%rbp)
1000747: 06 00 01
100074a: 8d 85 60 ff ff ff lea -0xa0(%rbp),%eax
1000750: 89 c0 mov %eax,%eax
1000752: ba 00 00 00 00 mov $0x0,%edx
1000757: 89 c6 mov %eax,%esi
1000759: bf 0c 00 00 00 mov $0xc,%edi
100075e: 66 90 xchg %ax,%ax
1000760: 66 66 66 2e 0f 1f 84 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000767: 00 00 00 00 00
100076c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000773: 0f 1f 84 00 00 00 00
100077a: 00
100077b: e8 c0 fb ff ff call 1000340 <.plt+0x2c0>
1000780: 8d 85 d0 fe ff ff lea -0x130(%rbp),%eax
1000786: 89 c2 mov %eax,%edx
1000788: be 00 00 00 00 mov $0x0,%esi
100078d: bf 0c 00 00 00 mov $0xc,%edi
1000792: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
1000799: 00 00
100079b: e8 a0 fb ff ff call 1000340 <.plt+0x2c0>
10007a0: 8b 95 60 ff ff ff mov -0xa0(%rbp),%edx
10007a6: 8b 85 d0 fe ff ff mov -0x130(%rbp),%eax
10007ac: 39 c2 cmp %eax,%edx
10007ae: 75 70 jne 1000820 <main+0x180>
10007b0: b8 78 05 00 11 mov $0x11000578,%eax
10007b5: 89 c0 mov %eax,%eax
10007b7: be 0a 00 00 00 mov $0xa,%esi
10007bc: 89 c7 mov %eax,%edi
10007be: 66 90 xchg %ax,%ax
10007c0: b8 00 00 00 00 mov $0x0,%eax
10007c5: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10007cc: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10007d3: 0f 1f 84 00 00 00 00
10007da: 00
10007db: e8 20 f9 ff ff call 1000100 <.plt+0x80>
10007e0: bf 00 00 00 00 mov $0x0,%edi
10007e5: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10007ec: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10007f3: 0f 1f 84 00 00 00 00
10007fa: 00
10007fb: e8 c0 fb ff ff call 10003c0 <.plt+0x340>
1000800: eb 5e jmp 1000860 <main+0x1c0>
1000802: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000809: 0f 1f 84 00 00 00 00
1000810: 00
1000811: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000818: 0f 1f 84 00 00 00 00
100081f: 00
1000820: b8 b0 05 00 11 mov $0x110005b0,%eax
1000825: 89 c0 mov %eax,%eax
1000827: be 0a 00 00 00 mov $0xa,%esi
100082c: 89 c7 mov %eax,%edi
100082e: b8 00 00 00 00 mov $0x0,%eax
1000833: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
100083a: 00
100083b: e8 c0 f8 ff ff call 1000100 <.plt+0x80>
1000840: bf 00 00 00 00 mov $0x0,%edi
1000845: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
100084c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000853: 0f 1f 84 00 00 00 00
100085a: 00
100085b: e8 60 fb ff ff call 10003c0 <.plt+0x340>
1000860: c7 45 fc 20 06 00 01 movl $0x1000620,-0x4(%rbp)
1000867: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
100086e: 00 00 00
1000871: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000878: 0f 1f 84 00 00 00 00
100087f: 00
1000880: 8b 55 fc mov -0x4(%rbp),%edx
1000883: b8 00 00 00 00 mov $0x0,%eax
1000888: 90 nop
1000889: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000890: 0f 1f 84 00 00 00 00
1000897: 00
1000898: 83 e2 e0 and $0xffffffe0,%edx
100089b: 4c 01 fa add %r15,%rdx
100089e: ff d2 call *%rdx
10008a0: eb de jmp 1000880 <main+0x1e0>
10008a2: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10008a9: 0f 1f 84 00 00 00 00
10008b0: 00
10008b1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10008b8: 0f 1f 84 00 00 00 00
10008bf: 00
10008c0: bf 02 00 00 00 mov $0x2,%edi
10008c5: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10008cc: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10008d3: 0f 1f 84 00 00 00 00
10008da: 00
10008db: e8 e0 f9 ff ff call 10002c0 <.plt+0x240>
10008e0: b8 f0 05 00 11 mov $0x110005f0,%eax
10008e5: 89 c2 mov %eax,%edx
10008e7: 8b 45 f8 mov -0x8(%rbp),%eax
10008ea: 89 c6 mov %eax,%esi
10008ec: 89 d7 mov %edx,%edi
10008ee: b8 00 00 00 00 mov $0x0,%eax
10008f3: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
10008fa: 00
10008fb: e8 00 f8 ff ff call 1000100 <.plt+0x80>
1000900: 8b 05 fa f6 01 10 mov 0x1001f6fa(%rip),%eax # 11020000 <stdout@GLIBC_2.2.5>
1000906: 89 c7 mov %eax,%edi
1000908: 0f 1f 40 00 nopl 0x0(%rax)
100090c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000913: 0f 1f 84 00 00 00 00
100091a: 00
100091b: e8 a0 fa ff ff call 10003c0 <.plt+0x340>
1000920: 8b 45 f8 mov -0x8(%rbp),%eax
1000923: be 0c 00 00 00 mov $0xc,%esi
1000928: 89 c7 mov %eax,%edi
100092a: 66 90 xchg %ax,%ax
100092c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000933: 0f 1f 84 00 00 00 00
100093a: 00
100093b: e8 c0 f9 ff ff call 1000300 <.plt+0x280>
1000940: bf 01 00 00 00 mov $0x1,%edi
1000945: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
100094c: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000953: 0f 1f 84 00 00 00 00
100095a: 00
100095b: e8 60 f9 ff ff call 10002c0 <.plt+0x240>
1000960: eb be jmp 1000920 <main+0x280>
1000962: 90 nop
The backtrace of the segfault is:
#0 SignalCatch (sig=11, info=0x7ff43bb34af0, uc=0x7ff43bb349c0) at src/trusted/service_runtime/linux/nacl_signal.c:558
#1 <signal handler called>
#2 0x00005ccffffefe80 in ?? ()
#3 0x00005ccf010008a0 in ?? ()
#4 0x0000000001000640 in ?? ()
#5 0x0000000000000000 in ?? ()
@ [#5:SignalCatch()] p natp
which corresponds to:
1000880: 8b 55 fc mov -0x4(%rbp),%edx
1000883: b8 00 00 00 00 mov $0x0,%eax
1000888: 90 nop
1000889: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
1000890: 0f 1f 84 00 00 00 00
1000897: 00
1000898: 83 e2 e0 and $0xffffffe0,%edx
100089b: 4c 01 fa add %r15,%rdx
100089e: ff d2 call *%rdx
10008a0: eb de jmp 1000880 <main+0x1e0>
10008a2: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
10008a9: 0f 1f 84 00 00 00 00
I printed out the register values while going through the signal handle previous to the segfault. Interestingly, rbp is the segfault address (on the stack) in both the registers on entry and exit, and the signal frame. This seems like it has to do with the bug but I need to understand calling conventions here a bit more to understand whats going on.
Registers upon previous signal catch entry - this is an untrusted signal
$3 = {rax = 0x0, rbx = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rsi = 0x0, rdi = 0x110e077c, rbp = 0x5ccffffefe80, stack_ptr = 0x5ccffffefd48, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}
we execute this line:
new_stack_ptr = regs->stack_ptr - NACL_STACK_RED_ZONE - 8; //the -8 to standardize things between trusted and untrusted
frame registers:
$5 = {rax = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rbx = 0x0, stack_ptr = 0x5ccffffefd48, rbp = 0x5ccffffefe80, rsi = 0x0, rdi = 0x110e077c, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202}
regs on exit (same as on entry, unmodified)
$8 = {rax = 0x0, rbx = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rsi = 0x0, rdi = 0x110e077c, rbp = 0x5ccffffefe80, stack_ptr = 0x5ccffffefd48, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}
natp user on exit
$9 = {rax = 0x0, rbx = 0x0, rcx = 0x0, rdx = 0x0, rbp = 0x5ccffffefe80, rsi = 0x0, rdi = 0xfffefbd0, rsp = 0x5ccffffefbc8, r8 = 0x0, r9 = 0x0, r10 = 0x0, r11 = 0x0, r12 = 0x0, r13 = 0x0, r14 = 0x0, r15 = 0x5ccf00000000, prog_ctr = 0x6eff00020bc0, new_prog_ctr = 0x5ccf01000640, sysret = 0x0, mxcsr = 0x1f80, sys_mxcsr = 0x1fa0, tls_idx = 0x2, fcw = 0x37f, sys_fcw = 0x37f, trusted_stack_ptr = 0x7ff43afaacc0, tls_value1 = 0x100425c0, tls_value2 = 0x0}
I've found out a few more things.
0000000001000620
Here is other print out:
Cage 2 pid: 2
Handler function ptr 0x1000640
Successfully changed the signal handler for Signal 10
Killing Cage 2 using PID: 2
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment**: new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = (nil)
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment:** new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = 0x60d000011f60
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment**: new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = 0x60d000011f60
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd48
**After adjustment**: new_stack_ptr = 0xfffefcc0**
Return Addr before: frame->return_addr = 0xffffffff
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
Signal 11 from untrusted code: pc=60d0fffefe80
After the stack_ptr drops, the segfault occurs.
It seems like the call pushing EIP at the beginning is somehow causing the stack to be off by two bytes on return?
Here is the new test code attached:
void print_registers() { unsigned long rax, rbx, rcx, rdx, rsp, rip, rbp, rsi, rdi;
asm volatile (
"mov %%rax, %0\n"
"mov %%rbx, %1\n"
"mov %%rcx, %2\n"
"mov %%rdx, %3\n"
"mov %%rsp, %4\n"
"mov %%rsi, %5\n"
"mov %%rdi, %6\n"
"call 1f\n" // call the next instruction
"1: pop %%rax\n" // pop the return address (RIP) from the stack
"mov %%rax, %7"
: "=m" (rax), "=m" (rbx), "=m" (rcx), "=m" (rdx),
"=m" (rsp), "=m" (rsi), "=m" (rdi), "=m" (rip)
:
: "rax", "rbx", "rcx", "rdx", "rsp", "rsi", "rdi"
);
printf("RAX: %lx\n", rax);
printf("RBX: %lx\n", rbx);
printf("RCX: %lx\n", rcx);
printf("RDX: %lx\n", rdx);
printf("RSP: %lx\n", rsp);
printf("RIP: %lx\n", rip);
printf("RSI: %lx\n", rsi);
printf("RDI: %lx\n", rdi);
}
int cnt = 0; int flag = 0; int somefunc() { if (flag) { printf("I can enter somefunc: %d\n", cnt); cnt += 1; print_registers(); // print values of registers fflush(stdout); } flag = 0; return 0;
}
void handler(int signum) {
printf("Signal Handler Test!\n");
fflush(stdout);
flag = 1;
return;
}
int main(void) { struct sigaction new_action; struct sigaction old_action;
pid_t pid = fork();
if (pid == 0) {
printf("Cage 2 pid: %ld\n", (long)getpid());
printf("Handler function ptr %p\n", handler);
fflush(stdout);
new_action.sa_handler = handler;
sigaction(SIGUSR2, &new_action, NULL);
sigaction(SIGUSR2, NULL, &old_action);
if (new_action.sa_handler == old_action.sa_handler) {
printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
} else {
printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
}
int (*f)() = &somefunc;
while (1) {
sleep(1);
f();
}
} else {
// Cage 1
sleep(2);
printf("Killing Cage 2 using PID: %ld\n", (long)pid);
fflush(stdout);
while (1) {
kill(pid, SIGUSR2);
sleep(1);
}
}
}
One thing that I want to point out is, if we add sleep(1) before calling f(), the program runs perfectly fine.
Here is the source code to print out current stack:
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
int cnt = 0;
int flag = 0;
int somefunc()
{
uintptr_t stack_var; // Local variable to get an address on the stack
uintptr_t* stack_ptr = &stack_var;
size_t depth = 100; // Number of stack entries to print
size_t i;
if (flag) {
printf("I can enter somefunc: %d\n", cnt);
cnt += 1;
// print_registers(); // print values of registers
printf("Stack contents:\n");
for (i = 0; i < depth; ++i) {
printf("[%lu] Address: %p, Value: 0x%lx\n", i, stack_ptr, *stack_ptr);
stack_ptr--; // Decrement to move to the next stack entry (for architectures where the stack grows downwards)
}
fflush(stdout);
}
flag = 0;
return 0;
}
void handler(int signum)
{
printf("Signal Handler Test!\n");
fflush(stdout);
flag = 1;
return;
}
int main(void)
{
// struct sigaction new_action = {0};
// struct sigaction old_action = {0};
struct sigaction new_action;
struct sigaction old_action;
pid_t pid = fork();
if (pid == 0) {
printf("Cage 2 pid: %ld\n", (long)getpid());
printf("Handler function ptr %p\n", handler);
fflush(stdout);
new_action.sa_handler = handler;
sigaction(SIGUSR2, &new_action, NULL);
sigaction(SIGUSR2, NULL, &old_action);
if (new_action.sa_handler == old_action.sa_handler) {
printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
} else {
printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
fflush(NULL);
}
int (*f)() = &somefunc;
while (1) {
// sleep(1);
f();
}
} else {
// Cage 1
sleep(2);
printf("Killing Cage 2 using PID: %ld\n", (long)pid);
fflush(stdout);
while (1) {
kill(pid, SIGUSR2);
sleep(1);
}
}
}
Wanted to write up a bit summarizing the current state of this issue and possibly get some feedback.
Bug
Segfault when catching signal and installing handler directly after call
instruction (both direct and indirect). This segfault is triggered upon the return from the called function.
Order
Background In case anyone needs a refresher on how this is all handled in x86 assembly, I think this writeup is pretty straightforward.
Essentially, the call instruction is pushing the next address onto the stack as a return address, then setting the program counter to the called function. Then the called function executes its "prologue" which pushes the base pointer onto the stack (to be restored later) and then sets the basepointer to the stack pointer to begin the function frame.
Problem 1: Comparing to Native I think the next thing we need to achieve is get a view of the register/stack state for a native example, but since this edge case is pretty specific I'm trying to figure out how to replicate it.
I think we may be able to do this:
Hopefully that works? If anyone else has any ideas that would be great
Potential Problem 2: Identifying Edge Case in NaCl I think regardless of the solution we come up with here, we'll need the NaCl signal mechanism to identify that we're executing this edge case. From there we can either 1) use the same sigtrap program counter advance we used to go to the next 32 byte boundary 2) Actually fix the edge case here. (2 is probably the better option as I'm not sure 1 will leave us enough 32 byte alignment options for signal returns)
All functions will start with a 32-byte aligned address and with the first two instructions being the prologue:
push %rbp
mov %rsp,%rbp
Can we reliably check if the pc is 32-byte aligned and those are the next two instructions to identify this case? I feel like it may be possible that this could be true at other places in the program? If so, is that a problem?
We're going to keep experimenting here, but @moyix @jesings @JustinCappos would love any input you guys might have
So I've identified the problem and believe I have a potential solution though some of the logistics still need to be figured out.
When setting up the signal handler. We allocate an exception frame on the stack, so do so we need to calculate an appropriate new stack pointer address. In practice it looks like this:
new_stack_ptr = regs->stack_ptr - NACL_STACK_RED_ZONE - 8; //the -8 to standardize things between trusted and untrusted
new_stack_ptr -= sizeof(struct NaClExceptionFrame) - NACL_STACK_PAD_BELOW_ALIGN;
**new_stack_ptr = new_stack_ptr & ~NACL_STACK_ALIGN_MASK;**
new_stack_ptr -= NACL_STACK_ARGS_SIZE;
new_stack_ptr -= NACL_STACK_PAD_BELOW_ALIGN;
The NACL_STACK_ALIGN_MASK is used to ensure 16 byte alignment for the stack, which is necessary for SSE instructions to work correctly.
Until we encountered this bug/segfault we didn't come across any situations where the stack was not 16byte aligned while handling (or if we did they didn't cause an error) however here we push the return address onto the stack right before the signal is caught. This means 1. the stack is not 16-byte aligned (we just pushed 8 bytes), and 2. that address is necessary because it holds the address the PC needs to go to after the function call.
This mask operation is causing the program later to attempt to jmp to an address on the stack that where rsp is off by 8 bytes in the function @jesings created "NaClTrampolineRegRestore" (code below). The lea 136...
insn is what restores the stack ptr which is off by the amount RED_ZONE + PAD_ALIGN (which is 136).
It's trivial to store the offset of the align mask to use to later restore correctly but actually adding that to the stack ptr at the end is difficult since all the registers are in use. To fix this we'll need to find a way to manipulate the amount of bytes that lea takes to take in account for the 8byte offset.
It's also possible (and probably likely, though we haven't seen it in practice ) that this bug could be extended to situations with a 4-byte offset. I'm still trying to understand the x86 conventions but seemingly it also could extend to all 1-15 byte offsets...
NaClTrampolineRegRestore:
# restore sigprocmask using custom syscall
# we must return to a 32 byte address,so pad with nops
.nops 23
lea 130(%rip), %rax
call *%rax # 128 + 2 where instruction is 2 bytes for this indirect call, this points to the address of the NaClSysSigmaskSigreturn trampoline entry
# we do this as an indirect call so that the call being at the end of a 32 byte section is easier
.align 32
addl $64, %esp /* skip unused fields */
addq %r15, %rsp
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %rsi #we avoid restoring rsp in this manner, so pop to a scratch register to be clobbered
popq %rsi #use a scratch register for rbp too
movl %esi, %ebp
addq %r15, %rbp
popq %rsi
popq %rdi
popq %r8
popq %r9
popq %r10
popq %r11
popq %r12
popq %r13
.align 32
addl $16, %esp # skip restoring r14, r15
addq %r15, %rsp
popq %r14
addl $24, %esp #skip flags for now, other unused fields
addq %r15, %rsp
.align 32
# andl $0xffffffe0, %r14d
movl %r14d, %r14d
addq %r15, %r14
andl $0x7ff, (%rsp) #this is cleared by the system anyway, but can't hurt
popfq #now restore flags from the duplicate location
leaq 136(%rsp), %rsp
jmp *%r14
So the above problem has been fixed. Sadly I've re-encounted the bug in hashbrown::iter() that is seemingly hard to identify. It also only happens sometimes depending on compilation it seems.
I tried to figure out how to actually determine state when the bug hits for a while since its a huge application and I can't recreate the bug in a debugger. I finally figured out how to enable coredumps in docker and was able to get a coredump and print info in gdb, I've put the threaded back trace and register state into this document: https://docs.google.com/document/d/1Z7GrSlxQlp6Vr6iYTCKtZR4mqFUFsaKtsC1Thj1I6Rk/edit
This was finally implemented with the mega-merge. Thank you to @jesings @AlpacaMax and all that helped here!
We've talked about eventually adding signal handling functionality in the past, and it seems like its necessary for pgbench to work, and possibly the rest of postgres, which seems heavily dependent on signals for communication.
I think its time to take a whack at this. Max is going to attempt to implement the syscalls for sigaction(), sigprocmask, and kill(), while Jonathan is going to attempt to have nacl call the registered function.
Will set up a meeting next week for us to go over this and put together a small design document.