Lind-Project / lind_project

Lind: Secure Lightweight Adaptive Isolation
https://hub.docker.com/r/securesystemslab/lind
Apache License 2.0
23 stars 8 forks source link

Signals #296

Closed rennergade closed 8 months ago

rennergade commented 1 year ago

We've talked about eventually adding signal handling functionality in the past, and it seems like its necessary for pgbench to work, and possibly the rest of postgres, which seems heavily dependent on signals for communication.

I think its time to take a whack at this. Max is going to attempt to implement the syscalls for sigaction(), sigprocmask, and kill(), while Jonathan is going to attempt to have nacl call the registered function.

Will set up a meeting next week for us to go over this and put together a small design document.

rennergade commented 1 year ago

Max has implemented kill() and sigaction() and has passed that off to @jesings who's going to work on the signal handler. Max now is going to work on implementing sigprocmask().

jesings commented 1 year ago

Currently I am able to receive a signal while the receiving thread is in untrusted code and then dispatch to a signal handler in untrusted code. In order to be able to return from that signal handler we need to do some return address hijacking on the stack to some trampoline stub that will restore the registers from their previous state. Additionally we need to figure out what we do when we receive a signal when the thread is in trusted code--we can't immediately process it because what if we're i.e. in the middle of modifying the vmmap. We should discuss how to implement this.

jesings commented 1 year ago

I am looking to implement this by adding an untrusted function on the springboard whose return address is placed onto the stack right below the stack frame for the handler. The NaCl signal handler currently does the following relevant things: It starts a new stack frame below the red zone of the stack frame that received the signal By creating a new stack pointer address for untrusted And then subtracts more space from that stack pointer address In order to have enough room to store this struct which contains, among other things, all of the registers It then populates the registers and other stuff into that stack The stack frame for the signal handling function starts below this one So when we return from the signal handler, these contents will still be on the stack And so we should return from the signal handler to some function that can actually populate these into our registers

rennergade commented 1 year ago

Great progress, thank you! I think this makes sense as a way to handle returns.

@moyix doe this make sense to you? any concerns or advice?

jesings commented 1 year ago

This register restoration technique actually won't work safely either in cagespace or in trusted without clobbering a register, which would be bad.

The issue is that when we return from the cagespace signal handler, we would execute some code that would need to restore registers, and after that code was finished we would need to return execution back to the point in cagespace that we were right before the signal was received. This code could either be in cagespace or trusted, and this problem would hold either way.

We need to restore every single register before finally restoring the instruction pointer, as once we restore the instruction pointer we start executing the pre-signal cagespace code again. Once we have restored every register except the instruction pointer, we are left in a situation where we cannot clobber any of these registers as we need them to be exactly as they were before cagespace received the signal. x86 does not allow you to change the instruction pointer without either using a fixed address jmp (which obviously doesn't work for our case), a different register for an indirect jmp, or using a ret to pop a return address from the stack and put it into %rip. Using a register for an indirect jump clearly doesn't work as that would require clobbering it. Returns from the stack are fraught because at this point the stack pointer points to cagespace memory which in order to safely use as a return address needs to have its top 32 bits masked off by the value in %r15, which is handled in nacl (in the naclret pseudo-instruction) by using the %r11 register to hold the address, masking the top 32 bits of that register off, and then using it as the target for an indrect jump--this clobbers %r11 so is no good, and in fact any such masking scheme requires the clobbering of a register.

Therefore, accomplishing this without for example adding a new reserved register in nacl-gcc seems like it may not be possible in a satisfactory way.

moyix commented 1 year ago

Couldn't you do something like:

  1. restore all registers except RIP
  2. push r11
  3. read saved RIP into r11
  4. mask r11
  5. write masked r11 back to stack
  6. pop r11
  7. ret

Or is the concern that since the stack is in untrusted memory someone could modify the saved RIP in between masking it and the ret?

moyix commented 1 year ago

I think the other part I don't understand is – if the normal naclret approach clobbers R11, why isn't it a problem for them?

jesings commented 1 year ago

@moyix Yes, the modification of the stack memory in between saving RIP and masking it is indeed the concern. I have that implemented already but I think the TOCTTOU attack really breaks NaCl's security model. The normal naclret is fine because r11 is a caller saved register so there is no problem if it gets clobbered in the return statement. However for signal handlers, we need to restore both caller and callee saved registers so that won't work.

rennergade commented 1 year ago

@moyix we're going to investigate if we can reserve another register and its effects. Do you think this is a good idea? We're struggling to come up with any other options.

moyix commented 1 year ago

The only other option I can think of is waiting to deliver the signal until you can guarantee that no threads in the cage are running (i.e. by blocking them from returning to untrusted code), which removes the TOCTOU since no one else is able to write in between.

I don't have a strong opinion on which of these options is easier to implement – also note that burning a register is fairly drastic on an arch like x86 where registers are scarce so I'm not sure how bad the perf hit would be.

Technically there are more exotic options like TSX, but I think those are not widely supported and possibly Intel is going to kill them.

rennergade commented 1 year ago

So apparently we can reserve a register without changing the gcc code with the -ffixed-reg flag. I experimented with it and it seems to work.

We can possibly avoid adding overhead in situations where we don't need signal handlers by only compiling with that for programs that use handlers (and disabling sigaction when we don't). May be overkill but its an idea.

As far as testing the performance hit, any ideas for benchmarks?

JustinCappos commented 1 year ago

I would think that the micro and macro benchmarks for the paper would be a good starting place.

JustinCappos commented 1 year ago

Returns from the stack are fraught because at this point the stack pointer points to cagespace memory which in order to safely use as a return address needs to have its top 32 bits masked off by the value in %r15, which is handled in nacl (in the naclret pseudo-instruction) by using the %r11 register to hold the address, masking the top 32 bits of that register off, and then using it as the target for an indrect jump--this clobbers %r11 so is no good, and in fact any such masking scheme requires the clobbering of a register.

I think I may have a solution, but please check it carefully.

I believe that the modifications that happen to r11 and r15 by NaCl code are for memory safety and the concern of returning from the stack is that you may be in the middle of a naclret pseudo instruction when you return from a signal.

My understanding is that r11 is set at the beginning of this pseudo instruction and is modified by r15.

Couldn't we check to see if we are in the middle of this sequence and just rewind the instruction pointer to be at the start of it?

Alternatively, consider the following. I believe that NaCl only uses %r15 for masking and only emits code that has a pattern set %r15 to the mask, mask %r11, jmp, etc. Could we detect if the next instruction would use %r15 for masking and if so rewind the IP by one instruction to where %r15 is set? We should then be able to safely use %r15, I think.

(I haven't thought through this carefully or re-examined the NaCl spec, so please view my suggestion with critical eye...)

jesings commented 1 year ago

While r15 is only used in NaCl code in pseudo instructions for memory safety to mask things, r11 is used as a general purpose register. r11 is also used in the naclret pseudo-instruction which clobbers it, but in the body of a function before the naclret it can be used arbitrarily. The naclcall and nacljmp pseudo instructions don't need r11 because they are given as a parameter a register value to jmp to/call. Thus we can't use either technique, as neither would be a complete solution to the issue. A lot of the micro benchmarks are kinds of workloads that would not benefit at all from 1 more register being available in untrusted, either because most of the program runtime would be within trusted (e.g. pipes), or because they are just doing memcpy/other simple operations which don't even use that many registers. I expect that in integer computation tasks things registers might be helpful but in those I would expect that one could use SIMD. What would be a relatively simple micro-benchmark that would use many registers?

JustinCappos commented 1 year ago

Does r15 get set before it is used each time or is it always assumed to have the correct value?

jesings commented 1 year ago

r15 is assumed to be a constant mask while in untrusted code and is never allowed to be modified except in trusted code, so we couldn't use it either.

jesings commented 1 year ago

As things stand receiving signals and returning from them works in untrusted 100% of the time and works in trusted 80% of the time. There remain two bugs. One of which which makes the program stall after the signal is received appears to be that the signal is received by the wrong thread. This happens maybe 10 percent of the time and I'm not sure exactly what's happening. @rennergade may have some insight. The other guy is when we receive signals in trusted while the trusted thread is trying to switch back to untrusted, we end up restoring the trusted state but in reality need to back up the program to set the untrusted state post syscall return to what we want. Thus we need to detect when we're switching syscalls. As far as I can tell the best way to do this is to unwind the stack and see if NaClSwitchToApp is on the call stack but this seems imperfect in case the gcc used to compile NaCl does something like frame pointer elision. I think it should be ok but if there's another way it's worth trying. I also considered checking the instruction pointer we are trying to return to and seeing if it is within one of a number of functions by taking the start and end address of these functions. This is more finicky and can go wrong. Finally there is a performance concern with storing the signal pending flag in a dash map in rust rather than in a thread local variable which would perform much better.

rennergade commented 1 year ago

Thanks @jesings this is awsome! Some responses:

  1. I ran my a test ~25 times and was not able to reproduce either. This first issue with the wrong threads is a bit puzzling. If you can send me a way to reproduce that that would be great. Maybe its a signal from RR?

  2. This one makes sense. Can we implement some sort of flag system so we know which one to change to at the beginning of NaClSwitch? is that possible?

  3. Yeah definitely agree with using a TL variable for this in nacl, think it will be much cleaner and definitely more performant.

jesings commented 1 year ago

So it seems both errors happen when the signal is caught in specific places in trusted--the loop error happens if the signal is received while in the code in trusted before the natp is populated with the user register values, and the seg fault error happens if the signal is caught in parts of trusted code that happen while restoring user register state (i.e. some registers are already restored). We can set a flag between the points where the natp is fully initialized and where we start to copy out so we know that if this flag is set when a signal is received then it's safe to proceed normally. However, if this flag is not set we need to do something different. In the case that the syuscall is caught in the beginning of natp copying, we probably need to restore the natp state from the registers dumped onto the user stack within the signal handler. In the case that the signal is caught on exit of the syscall then we just need to rerun the copying-out-to-user code. However, detecting whether we're in the entry or exit of a syscall is annoying. A simple flag wouldn't work because it takes instructions in trusted to set/unset that flag, and while that flag would be being set/unset there would be points in the code where the flag says we're not in entry when we are. We can try doing stack unwinding manually to see whether we have an enclosing function that signifies the end or start of the syscall. Or we can try to write the entry entirely in assembly and statically check the addresses of the function and check if our program counter is within the addresses of the syscall entry functions. There might be another approach but this is where things currently stand.

JustinCappos commented 1 year ago

Or we can try to write the entry entirely in assembly and statically check the addresses of the function and check if our program counter is within the addresses of the syscall entry functions.

This is what my mind immediately jumped to. In theory we could make sure these are placed in some known, reserved ranges to make this easier (i.e. checking a bitmask on the address).

(This seems to be a reasonable way to do so, but I'm not sure how this works with NaCl's toolchain: https://stackoverflow.com/questions/18778323/how-to-specify-a-memory-location-at-which-function-will-get-stored. You may also need to pad this or do some other trickery so that other things don't get stored there as well. )

rennergade commented 1 year ago

We should have control of the function placement so I don't think that part should be an issue, just wondering how difficult the assembly rewrite is going to be.

JustinCappos commented 1 year ago

Why does the handler have to be entirely in assembly? Why not just the address checking portion?

On Mon, Jul 10, 2023 at 9:49 PM Nicholas Renner @.***> wrote:

We should have control of the function placement so I don't think that part should be an issue, just wondering how difficult the assembly rewrite is going to be.

— Reply to this email directly, view it on GitHub https://github.com/Lind-Project/lind_project/issues/296#issuecomment-1629978940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGROD7Z2PSOKNW5LYKBYRTXPSWJVANCNFSM6AAAAAAUXAIJOY . You are receiving this because you commented.Message ID: @.***>

jesings commented 1 year ago

The handler doesn't have to be all in assembly--there are two functions called on syscall entry before the trusted state is suitable to receive a signal normally. We can actually check the address bounds by just putting a label in a c function in inline assembly so it turns out we don't really need to rewrite things in assembly.

jesings commented 1 year ago

So two issues as far as I can tell remain, although we are up to about a 98 percent success rate. The first issue is with the GetTLSFastPath1 and 2 functions which have a completely separate path in and out of trusted code to normal syscalls. I need to do more inquiry into this but it should be solvable. The second issue is a race condition which I want some input on if anyone has. The syscall entry code sets the instruction pointer value to return to when the syscall returns. The signal code modifies this same value. We can check in a syscall if a signal was received already using a flag and decide not to modify this value (and instead modify a separate value), however whatever ways I can think of to do this result in a TOCTTOU error as signals can be received between the check and the set. I have a vague sense that maybe we could use the xchg instruction to allow us to set a value and receive an old value in a single instruction in order to test whether we need to run cleanup code, but this idea isn't fully solidified in my head yet. Does it seem reasonable? Does anyone have any other suggestions for how to avoid this TOCTTOU?

rennergade commented 1 year ago

I've unified @jesings changs with the syscall implementations that @AlpacaMax made in the Spring, as well as updating them with the latest develop and adding some glue for everything. The good news is that postgres now runs further, and the Reaper() server loop that we had been faking now runs correctly via signals instead of a hack job.

Unfortunately postgres still seems to hang. @RusherRG and I have tracked the bug down to SIGUSR2 signals not being properly caught from the postgres autovacuum. Here it becomes a headscratcher.

We've set up Lind's signal handling to always send signals, and for the NaCl signal handler SignalCatch to always receive and handle them. There we decide if we want to send them to a user program handler, or mask and resend later, return, or terminate.

However, we never catch SIGUSR2 via SignalCatch, which is deeply confusing. @Yaxuan-w helped me make user tests for SIGUSR2 which seem to work fine. Yet we don't receive them running the postgres/pgbench suite.

The other thing thats striking is it seems that all of the threads are stuck in rustposix's poll()/select() implementation when they are not receiving signals. I'm not sure if this is just a coincidence because postgres seems to heavily use poll/select. This did lead me to fix a bug where we weren't properly doing an EINTR when poll/select received signals, but still no dice on the larger bug.

IMHO, the only thing that would stop SignalCatch from receiving the SIGUSR2 signal is if it was actually masked by the host. The only places that could happen are sigaction (which is only called at setup, definitely not the source of the problem), or returning via the kernels sigreturn or Jonathan's added NaClSysSigreturn for returning from user handlers.

However, combing through these it seems like none of those are happening on the thread before the sigusr2 is sent, and its still not being received.

@RusherRG and I are going to continue to try to solve this one. If anyone has any ideas let us know.

rennergade commented 1 year ago

@RusherRG @Yaxuan-w and I fixed the above issues. The no signal problem was solved by properly resetting the signal mask on forks, and we fixed a concurrency issue by implementing semaphores as well as fixing the return values on error from glibc.

There is either one or two errors left thats preventing us from running the full stack. The first one is a known signal bug and was documented by @jesings before he left:

* One rare bug remains, when we are signaled in untrusted (whether after single stepping or not)
 * and the last untrusted instruction executed is an indirect call (or probably jmp), we end up
 * seg faulting somewhere down the line. I bet the stack is somehow somewhat wonky in this case
 * but sadly I have to leave this bug unsolved before I leave.
 */

This may or may not be causing the second bug which is a mysterious RustPOSIX panic! inside of the DashMap iter() routine, which doesn't really make sense. Since that one only happens when signals are enabled I'm going to try to fix the known bug first.

rennergade commented 1 year ago

This bug seems to panic w/ a stack address as the program counter, which is definitely wrong. I managed to stop when it catches the sigsegv and get a (very bizzarre) backtrace as well as the register values.

registers:

$5 = {rax = 0x18, rbx = 0x1f036a8, rcx = 0x1024c3f8, rdx = 0x10140450, rsi = 0x1024c3f8, rdi = 0x10140450, rbp = 0x619ffffee4a0, stack_ptr = 0x619ffffee478, r8 = 0xfffee466, r9 = 0x530f, r10 = 0x0, r11 = 0x619ffffee4a0, r12 = 0x0, r13 = 0x193, r14 = 0x619f01a95200, r15 = 0x619f00000000, prog_ctr = 0x619ffffee4a0, flags = 0x10306, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}

backtrace:

Thread 17 (Thread 0x7ffff735c640 (LWP 296795) "sel_ldr"):
#0  SignalCatch (sig=11, info=0x7ffff7360af0, uc=0x7ffff73609c0) at src/trusted/service_runtime/linux/nacl_signal.c:686
#1  <signal handler called>
#2  0x0000619ffffee4a0 in ?? ()
#3  0x0000619f01aa1060 in ?? ()
#4  0x0000000100000010 in ?? ()
#5  0x1024c3f8f63ec0f4 in ?? ()
#6  0x2000000000000001 in ?? ()
#7  0x101404501012a8d0 in ?? ()
#8  0x0000619ffffee510 in ?? ()
#9  0x0000619f01ab3de0 in ?? ()
#10 0x20000000000001f2 in ?? ()
#11 0xfffee83000000001 in ?? ()
#12 0x1013e168fffee850 in ?? ()
#13 0x1012a8d0f63ead20 in ?? ()
#14 0x0100619f1024c3f8 in ?? ()
#15 0x000000001012ed48 in ?? ()
#16 0x0000619ffffee500 in ?? ()
#17 0x100737380142d9e0 in ?? ()
#18 0x1024c3f801774aa0 in ?? ()
#19 0x1012ed4800000001 in ?? ()
#20 0x0000619ffffee520 in ?? ()
#21 0x000000001012a8d0 in ?? ()
#22 0x0000619ffffee540 in ?? ()
#23 0x0000619f010e2240 in ?? ()
#24 0xfffee850fffee830 in ?? ()
#25 0x1012f6a01013e168 in ?? ()
#26 0xfffee57ffffee560 in ?? ()
#27 0x000000001012f6a0 in ?? ()
#28 0x0000619ffffee590 in ?? ()
#29 0x0000619f010e24e0 in ?? ()
#30 0x2000000000000001 in ?? ()
#31 0xfffee801fffeea10 in ?? ()
#32 0xfffee850fffee830 in ?? ()
#33 0xf63ead201013e168 in ?? ()
#34 0x000001f2f6df6e00 in ?? ()
#35 0xfffeea101012ed48 in ?? ()
#36 0x0000000100000000 in ?? ()
#37 0x00000000f63ead20 in ?? ()
#38 0x0000619ffffee950 in ?? ()
#39 0x0000619f010a5de0 in ?? ()
#40 0xfffee8a000000000 in ?? ()
#41 0xfffee860111d2001 in ?? ()
#42 0x0000619ffffee650 in ?? ()
#43 0x0000000001729b01 in ?? ()
#44 0x1013e028fffee501 in ?? ()
#45 0xf63e7ca0f63ead20 in ?? ()
#46 0x0000000000000034 in ?? ()
#47 0x100d544001774a70 in ?? ()
#48 0x0004000300020001 in ?? ()
#49 0x0008000700060005 in ?? ()
#50 0x000c000b000a0009 in ?? ()
#51 0x0010000f000e000d in ?? ()
#52 0x0014001300120011 in ?? ()
#53 0x0018001700160015 in ?? ()
#54 0x001c001b001a0019 in ?? ()
#55 0x0020001f001e001d in ?? ()
#56 0x0024002300220021 in ?? ()
#57 0x0028002700260025 in ?? ()
#58 0x002c002b002a0029 in ?? ()
#59 0x0030002f002e002d in ?? ()
#60 0x0034003300320031 in ?? ()
#61 0x0038003700360035 in ?? ()
#62 0x003c003b003a0039 in ?? ()
rennergade commented 1 year ago

I was able to create a minimal example for this bug. The example forks a child, registers a signal hanlder, and then loops continuously over an indirect call. The parent then loops signaling the child. Here is the code:

#include <errno.h>
#include <string.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h> 

int somefunc()
{
    return 0;

}

void handler(int signum)
{
     printf("Signal Handler Test!\n");
     fflush(stdout);
    return;
}

int main(void)
{
     struct sigaction new_action;
     struct sigaction old_action;

     pid_t pid = fork();

     if (pid == 0) {
               printf("Cage 2 pid: %ld\n", (long)getpid());
               printf("Handler function ptr %p\n", handler);
               fflush(stdout);

               new_action.sa_handler = handler;

               sigaction(SIGUSR2, &new_action, NULL);
               sigaction(SIGUSR2, NULL, &old_action);

               if (new_action.sa_handler == old_action.sa_handler) {
                    printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
                    fflush(NULL);
               } else {
                    printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
                    fflush(NULL);
               }

          int (*f)() = &somefunc;
          while (1) {
               f();
          }
     } else {
               // Cage 1
               sleep(2);
               printf("Killing Cage 2 using PID: %ld\n", (long)pid);
               fflush(stdout);

               while (1) {
                    kill(pid, SIGUSR2);
                    sleep(1);
               }
     }

}
rennergade commented 1 year ago

for reference this example disassembles to:


0000000001000620 <somefunc>:
 1000620:   55                      push   %rbp
 1000621:   48 89 e5                mov    %rsp,%rbp
 1000624:   b8 00 00 00 00          mov    $0x0,%eax
 1000629:   41 5b                   pop    %r11
 100062b:   44 89 dd                mov    %r11d,%ebp
 100062e:   4c 01 fd                add    %r15,%rbp
 1000631:   41 5b                   pop    %r11
 1000633:   41 83 e3 e0             and    $0xffffffe0,%r11d
 1000637:   4d 01 fb                add    %r15,%r11
 100063a:   41 ff e3                jmp    *%r11
 100063d:   0f 1f 00                nopl   (%rax)

0000000001000640 <handler>:
 1000640:   55                      push   %rbp
 1000641:   48 89 e5                mov    %rsp,%rbp
 1000644:   83 ec 10                sub    $0x10,%esp
 1000647:   4c 01 fc                add    %r15,%rsp
 100064a:   89 7d fc                mov    %edi,-0x4(%rbp)
 100064d:   bf 38 05 00 11          mov    $0x11000538,%edi
 1000652:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 1000659:   00 00 
 100065b:   e8 e0 fa ff ff          call   1000140 <.plt+0xc0>
 1000660:   8b 05 9a f9 01 10       mov    0x1001f99a(%rip),%eax        # 11020000 <stdout@GLIBC_2.2.5>
 1000666:   89 c7                   mov    %eax,%edi
 1000668:   0f 1f 40 00             nopl   0x0(%rax)
 100066c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000673:   0f 1f 84 00 00 00 00 
 100067a:   00 
 100067b:   e8 40 fd ff ff          call   10003c0 <.plt+0x340>
 1000680:   48 89 ec                mov    %rbp,%rsp
 1000683:   41 5b                   pop    %r11
 1000685:   44 89 dd                mov    %r11d,%ebp
 1000688:   4c 01 fd                add    %r15,%rbp
 100068b:   41 5b                   pop    %r11
 100068d:   41 83 e3 e0             and    $0xffffffe0,%r11d
 1000691:   4d 01 fb                add    %r15,%r11
 1000694:   41 ff e3                jmp    *%r11
 1000697:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 100069e:   00 00 

00000000010006a0 <main>:
 10006a0:   55                      push   %rbp
 10006a1:   48 89 e5                mov    %rsp,%rbp
 10006a4:   81 ec 30 01 00 00       sub    $0x130,%esp
 10006aa:   4c 01 fc                add    %r15,%rsp
 10006ad:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10006b4:   1f 84 00 00 00 00 00 
 10006bb:   e8 c0 fc ff ff          call   1000380 <.plt+0x300>
 10006c0:   89 45 f8                mov    %eax,-0x8(%rbp)
 10006c3:   83 7d f8 00             cmpl   $0x0,-0x8(%rbp)
 10006c7:   0f 85 f3 01 00 00       jne    10008c0 <main+0x220>
 10006cd:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10006d4:   1f 84 00 00 00 00 00 
 10006db:   e8 60 fb ff ff          call   1000240 <.plt+0x1c0>
 10006e0:   ba 4d 05 00 11          mov    $0x1100054d,%edx
 10006e5:   89 d2                   mov    %edx,%edx
 10006e7:   89 c6                   mov    %eax,%esi
 10006e9:   89 d7                   mov    %edx,%edi
 10006eb:   b8 00 00 00 00          mov    $0x0,%eax
 10006f0:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
 10006f7:   00 00 00 00 
 10006fb:   e8 00 fa ff ff          call   1000100 <.plt+0x80>
 1000700:   b8 5e 05 00 11          mov    $0x1100055e,%eax
 1000705:   89 c0                   mov    %eax,%eax
 1000707:   be 40 06 00 01          mov    $0x1000640,%esi
 100070c:   89 c7                   mov    %eax,%edi
 100070e:   b8 00 00 00 00          mov    $0x0,%eax
 1000713:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 100071a:   00 
 100071b:   e8 e0 f9 ff ff          call   1000100 <.plt+0x80>
 1000720:   8b 05 da f8 01 10       mov    0x1001f8da(%rip),%eax        # 11020000 <stdout@GLIBC_2.2.5>
 1000726:   89 c7                   mov    %eax,%edi
 1000728:   0f 1f 40 00             nopl   0x0(%rax)
 100072c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000733:   0f 1f 84 00 00 00 00 
 100073a:   00 
 100073b:   e8 80 fc ff ff          call   10003c0 <.plt+0x340>
 1000740:   c7 85 60 ff ff ff 40    movl   $0x1000640,-0xa0(%rbp)
 1000747:   06 00 01 
 100074a:   8d 85 60 ff ff ff       lea    -0xa0(%rbp),%eax
 1000750:   89 c0                   mov    %eax,%eax
 1000752:   ba 00 00 00 00          mov    $0x0,%edx
 1000757:   89 c6                   mov    %eax,%esi
 1000759:   bf 0c 00 00 00          mov    $0xc,%edi
 100075e:   66 90                   xchg   %ax,%ax
 1000760:   66 66 66 2e 0f 1f 84    data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000767:   00 00 00 00 00 
 100076c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000773:   0f 1f 84 00 00 00 00 
 100077a:   00 
 100077b:   e8 c0 fb ff ff          call   1000340 <.plt+0x2c0>
 1000780:   8d 85 d0 fe ff ff       lea    -0x130(%rbp),%eax
 1000786:   89 c2                   mov    %eax,%edx
 1000788:   be 00 00 00 00          mov    $0x0,%esi
 100078d:   bf 0c 00 00 00          mov    $0xc,%edi
 1000792:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
 1000799:   00 00 
 100079b:   e8 a0 fb ff ff          call   1000340 <.plt+0x2c0>
 10007a0:   8b 95 60 ff ff ff       mov    -0xa0(%rbp),%edx
 10007a6:   8b 85 d0 fe ff ff       mov    -0x130(%rbp),%eax
 10007ac:   39 c2                   cmp    %eax,%edx
 10007ae:   75 70                   jne    1000820 <main+0x180>
 10007b0:   b8 78 05 00 11          mov    $0x11000578,%eax
 10007b5:   89 c0                   mov    %eax,%eax
 10007b7:   be 0a 00 00 00          mov    $0xa,%esi
 10007bc:   89 c7                   mov    %eax,%edi
 10007be:   66 90                   xchg   %ax,%ax
 10007c0:   b8 00 00 00 00          mov    $0x0,%eax
 10007c5:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 10007cc:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10007d3:   0f 1f 84 00 00 00 00 
 10007da:   00 
 10007db:   e8 20 f9 ff ff          call   1000100 <.plt+0x80>
 10007e0:   bf 00 00 00 00          mov    $0x0,%edi
 10007e5:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 10007ec:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10007f3:   0f 1f 84 00 00 00 00 
 10007fa:   00 
 10007fb:   e8 c0 fb ff ff          call   10003c0 <.plt+0x340>
 1000800:   eb 5e                   jmp    1000860 <main+0x1c0>
 1000802:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000809:   0f 1f 84 00 00 00 00 
 1000810:   00 
 1000811:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000818:   0f 1f 84 00 00 00 00 
 100081f:   00 
 1000820:   b8 b0 05 00 11          mov    $0x110005b0,%eax
 1000825:   89 c0                   mov    %eax,%eax
 1000827:   be 0a 00 00 00          mov    $0xa,%esi
 100082c:   89 c7                   mov    %eax,%edi
 100082e:   b8 00 00 00 00          mov    $0x0,%eax
 1000833:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 100083a:   00 
 100083b:   e8 c0 f8 ff ff          call   1000100 <.plt+0x80>
 1000840:   bf 00 00 00 00          mov    $0x0,%edi
 1000845:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 100084c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000853:   0f 1f 84 00 00 00 00 
 100085a:   00 
 100085b:   e8 60 fb ff ff          call   10003c0 <.plt+0x340>
 1000860:   c7 45 fc 20 06 00 01    movl   $0x1000620,-0x4(%rbp)
 1000867:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
 100086e:   00 00 00 
 1000871:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000878:   0f 1f 84 00 00 00 00 
 100087f:   00 
 1000880:   8b 55 fc                mov    -0x4(%rbp),%edx
 1000883:   b8 00 00 00 00          mov    $0x0,%eax
 1000888:   90                      nop
 1000889:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000890:   0f 1f 84 00 00 00 00 
 1000897:   00 
 1000898:   83 e2 e0                and    $0xffffffe0,%edx
 100089b:   4c 01 fa                add    %r15,%rdx
 100089e:   ff d2                   call   *%rdx
 10008a0:   eb de                   jmp    1000880 <main+0x1e0>
 10008a2:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10008a9:   0f 1f 84 00 00 00 00 
 10008b0:   00 
 10008b1:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10008b8:   0f 1f 84 00 00 00 00 
 10008bf:   00 
 10008c0:   bf 02 00 00 00          mov    $0x2,%edi
 10008c5:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 10008cc:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10008d3:   0f 1f 84 00 00 00 00 
 10008da:   00 
 10008db:   e8 e0 f9 ff ff          call   10002c0 <.plt+0x240>
 10008e0:   b8 f0 05 00 11          mov    $0x110005f0,%eax
 10008e5:   89 c2                   mov    %eax,%edx
 10008e7:   8b 45 f8                mov    -0x8(%rbp),%eax
 10008ea:   89 c6                   mov    %eax,%esi
 10008ec:   89 d7                   mov    %edx,%edi
 10008ee:   b8 00 00 00 00          mov    $0x0,%eax
 10008f3:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 10008fa:   00 
 10008fb:   e8 00 f8 ff ff          call   1000100 <.plt+0x80>
 1000900:   8b 05 fa f6 01 10       mov    0x1001f6fa(%rip),%eax        # 11020000 <stdout@GLIBC_2.2.5>
 1000906:   89 c7                   mov    %eax,%edi
 1000908:   0f 1f 40 00             nopl   0x0(%rax)
 100090c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000913:   0f 1f 84 00 00 00 00 
 100091a:   00 
 100091b:   e8 a0 fa ff ff          call   10003c0 <.plt+0x340>
 1000920:   8b 45 f8                mov    -0x8(%rbp),%eax
 1000923:   be 0c 00 00 00          mov    $0xc,%esi
 1000928:   89 c7                   mov    %eax,%edi
 100092a:   66 90                   xchg   %ax,%ax
 100092c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000933:   0f 1f 84 00 00 00 00 
 100093a:   00 
 100093b:   e8 c0 f9 ff ff          call   1000300 <.plt+0x280>
 1000940:   bf 01 00 00 00          mov    $0x1,%edi
 1000945:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 100094c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000953:   0f 1f 84 00 00 00 00 
 100095a:   00 
 100095b:   e8 60 f9 ff ff          call   10002c0 <.plt+0x240>
 1000960:   eb be                   jmp    1000920 <main+0x280>
 1000962:   90                      nop
rennergade commented 1 year ago

The backtrace of the segfault is:

#0  SignalCatch (sig=11, info=0x7ff43bb34af0, uc=0x7ff43bb349c0) at src/trusted/service_runtime/linux/nacl_signal.c:558
#1  <signal handler called>
#2  0x00005ccffffefe80 in ?? ()
#3  0x00005ccf010008a0 in ?? ()
#4  0x0000000001000640 in ?? ()
#5  0x0000000000000000 in ?? ()
@ [#5:SignalCatch()] p natp

which corresponds to:

 1000880:   8b 55 fc                mov    -0x4(%rbp),%edx
 1000883:   b8 00 00 00 00          mov    $0x0,%eax
 1000888:   90                      nop
 1000889:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 1000890:   0f 1f 84 00 00 00 00 
 1000897:   00 
 1000898:   83 e2 e0                and    $0xffffffe0,%edx
 100089b:   4c 01 fa                add    %r15,%rdx
 100089e:   ff d2                   call   *%rdx
 10008a0:   eb de                   jmp    1000880 <main+0x1e0>
 10008a2:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
 10008a9:   0f 1f 84 00 00 00 00 
rennergade commented 1 year ago

I printed out the register values while going through the signal handle previous to the segfault. Interestingly, rbp is the segfault address (on the stack) in both the registers on entry and exit, and the signal frame. This seems like it has to do with the bug but I need to understand calling conventions here a bit more to understand whats going on.

Registers upon previous signal catch entry - this is an untrusted signal
$3 = {rax = 0x0, rbx = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rsi = 0x0, rdi = 0x110e077c, rbp = 0x5ccffffefe80, stack_ptr = 0x5ccffffefd48, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}

we execute this line:
      new_stack_ptr = regs->stack_ptr - NACL_STACK_RED_ZONE - 8; //the -8 to standardize things between trusted and untrusted

frame registers:
$5 = {rax = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rbx = 0x0, stack_ptr = 0x5ccffffefd48, rbp = 0x5ccffffefe80, rsi = 0x0, rdi = 0x110e077c, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202}

regs on exit (same as on entry, unmodified)
$8 = {rax = 0x0, rbx = 0x0, rcx = 0x1, rdx = 0x5ccf01000620, rsi = 0x0, rdi = 0x110e077c, rbp = 0x5ccffffefe80, stack_ptr = 0x5ccffffefd48, r8 = 0xffffffff, r9 = 0x0, r10 = 0x0, r11 = 0x5ccf010008a0, r12 = 0x0, r13 = 0x0, r14 = 0x5ccf01000880, r15 = 0x5ccf00000000, prog_ctr = 0x5ccf01000620, flags = 0x202, cs = 0x33, ss = 0x0, ds = 0x0, es = 0x0, fs = 0x0, gs = 0x0, padding = 0x0}

natp user on exit
$9 = {rax = 0x0, rbx = 0x0, rcx = 0x0, rdx = 0x0, rbp = 0x5ccffffefe80, rsi = 0x0, rdi = 0xfffefbd0, rsp = 0x5ccffffefbc8, r8 = 0x0, r9 = 0x0, r10 = 0x0, r11 = 0x0, r12 = 0x0, r13 = 0x0, r14 = 0x0, r15 = 0x5ccf00000000, prog_ctr = 0x6eff00020bc0, new_prog_ctr = 0x5ccf01000640, sysret = 0x0, mxcsr = 0x1f80, sys_mxcsr = 0x1fa0, tls_idx = 0x2, fcw = 0x37f, sys_fcw = 0x37f, trusted_stack_ptr = 0x7ff43afaacc0, tls_value1 = 0x100425c0, tls_value2 = 0x0}
rennergade commented 1 year ago

I've found out a few more things.

  1. Importantly, this bug actually happens for indirect AND direct calls
  2. This bug happens when the signal is caught at the first address of the called function. Here the program counter is always 0x01000620 which is here.

0000000001000620 : 1000620: 55 push %rbp

yzhang71 commented 1 year ago

Here is other print out:

Cage 2 pid: 2
Handler function ptr 0x1000640
Successfully changed the signal handler for Signal 10
Killing Cage 2 using PID: 2
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment**: new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = (nil)
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment:** new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = 0x60d000011f60
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd50
**After adjustment**: new_stack_ptr = 0xfffefcc8**
Return Addr before: frame->return_addr = 0x60d000011f60
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!
****Before adjustment**: regs->stack_ptr = 0x60d0fffefd48
**After adjustment**: new_stack_ptr = 0xfffefcc0**
Return Addr before: frame->return_addr = 0xffffffff
Return Addr after: frame->return_addr = 0x60d000011f40
Signal Handler Test!

Signal 11 from untrusted code: pc=60d0fffefe80

After the stack_ptr drops, the segfault occurs.

rennergade commented 1 year ago

It seems like the call pushing EIP at the beginning is somehow causing the stack to be off by two bytes on return?

When executing a near call, the processor pushes the value of the EIP register (which contains the offset of the instruction following the CALL instruction) onto the stack (for use later as a return-instruction pointer).

yzhang71 commented 1 year ago

Here is the new test code attached:

include

include

include

include

include <sys/types.h>

include

include

void print_registers() { unsigned long rax, rbx, rcx, rdx, rsp, rip, rbp, rsi, rdi;

asm volatile (
    "mov %%rax, %0\n"
    "mov %%rbx, %1\n"
    "mov %%rcx, %2\n"
    "mov %%rdx, %3\n"
    "mov %%rsp, %4\n"
    "mov %%rsi, %5\n"
    "mov %%rdi, %6\n"
    "call 1f\n"   // call the next instruction
    "1: pop %%rax\n"  // pop the return address (RIP) from the stack
    "mov %%rax, %7"
    : "=m" (rax), "=m" (rbx), "=m" (rcx), "=m" (rdx),
      "=m" (rsp), "=m" (rsi), "=m" (rdi), "=m" (rip)
    :
    : "rax", "rbx", "rcx", "rdx", "rsp", "rsi", "rdi"
);

printf("RAX: %lx\n", rax);
printf("RBX: %lx\n", rbx);
printf("RCX: %lx\n", rcx);
printf("RDX: %lx\n", rdx);
printf("RSP: %lx\n", rsp);
printf("RIP: %lx\n", rip);
printf("RSI: %lx\n", rsi);
printf("RDI: %lx\n", rdi);

}

int cnt = 0; int flag = 0; int somefunc() { if (flag) { printf("I can enter somefunc: %d\n", cnt); cnt += 1; print_registers(); // print values of registers fflush(stdout); } flag = 0; return 0;

}

void handler(int signum) {

 printf("Signal Handler Test!\n");
 fflush(stdout);
 flag = 1;
return;

}

int main(void) { struct sigaction new_action; struct sigaction old_action;

 pid_t pid = fork();

 if (pid == 0) {
           printf("Cage 2 pid: %ld\n", (long)getpid());
           printf("Handler function ptr %p\n", handler);
           fflush(stdout);

           new_action.sa_handler = handler;

           sigaction(SIGUSR2, &new_action, NULL);
           sigaction(SIGUSR2, NULL, &old_action);

           if (new_action.sa_handler == old_action.sa_handler) {
                printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
                fflush(NULL);
           } else {
                printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
                fflush(NULL);
           }

      int (*f)() = &somefunc;
      while (1) {
           sleep(1);
           f();
      }
 } else {
           // Cage 1
           sleep(2);
           printf("Killing Cage 2 using PID: %ld\n", (long)pid);
           fflush(stdout);

           while (1) {
                kill(pid, SIGUSR2);
                sleep(1);
           }
 }

}

One thing that I want to point out is, if we add sleep(1) before calling f(), the program runs perfectly fine.

yzhang71 commented 1 year ago

Here is the source code to print out current stack:

#include <errno.h>
#include <string.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h> 
#include <stdint.h>

int cnt = 0;
int flag = 0;
int somefunc()
{
    uintptr_t stack_var;  // Local variable to get an address on the stack
    uintptr_t* stack_ptr = &stack_var;
    size_t depth = 100;  // Number of stack entries to print
    size_t i;
    if (flag) {
          printf("I can enter somefunc: %d\n", cnt);
          cnt += 1;
          // print_registers(); // print values of registers

          printf("Stack contents:\n");
          for (i = 0; i < depth; ++i) {
               printf("[%lu] Address: %p, Value: 0x%lx\n", i, stack_ptr, *stack_ptr);
               stack_ptr--;  // Decrement to move to the next stack entry (for architectures where the stack grows downwards)
          }
          fflush(stdout);
    }
    flag = 0;
    return 0;

}

void handler(int signum)
{

     printf("Signal Handler Test!\n");
     fflush(stdout);
     flag = 1;
    return;
}

int main(void)
{
     // struct sigaction new_action = {0};
     // struct sigaction old_action = {0};

     struct sigaction new_action;
     struct sigaction old_action;

     pid_t pid = fork();

     if (pid == 0) {
               printf("Cage 2 pid: %ld\n", (long)getpid());
               printf("Handler function ptr %p\n", handler);
               fflush(stdout);

               new_action.sa_handler = handler;

               sigaction(SIGUSR2, &new_action, NULL);
               sigaction(SIGUSR2, NULL, &old_action);

               if (new_action.sa_handler == old_action.sa_handler) {
                    printf("Successfully changed the signal handler for Signal %d\n", SIGUSR1);
                    fflush(NULL);
               } else {
                    printf("Fail to correctly change the signal handler for Signal %d\n", SIGUSR1);
                    fflush(NULL);
               }

          int (*f)() = &somefunc;
          while (1) {
               // sleep(1);
               f();
          }
     } else {
               // Cage 1
               sleep(2);
               printf("Killing Cage 2 using PID: %ld\n", (long)pid);
               fflush(stdout);

               while (1) {
                    kill(pid, SIGUSR2);
                    sleep(1);
               }
     }
}
rennergade commented 1 year ago

Wanted to write up a bit summarizing the current state of this issue and possibly get some feedback.

Bug Segfault when catching signal and installing handler directly after call instruction (both direct and indirect). This segfault is triggered upon the return from the called function.

Order

  1. call to function "somefunc"
  2. catch signal with handler ("handlefunc") when PC is set to function ptr of somefunc (directly after call), install handler
  3. return to untrusted code and run handlefunc
  4. return to somefunc, attempt to return to normal return address
  5. Segfault

Background In case anyone needs a refresher on how this is all handled in x86 assembly, I think this writeup is pretty straightforward.

Essentially, the call instruction is pushing the next address onto the stack as a return address, then setting the program counter to the called function. Then the called function executes its "prologue" which pushes the base pointer onto the stack (to be restored later) and then sets the basepointer to the stack pointer to begin the function frame.

Problem 1: Comparing to Native I think the next thing we need to achieve is get a view of the register/stack state for a native example, but since this edge case is pretty specific I'm trying to figure out how to replicate it.

I think we may be able to do this:

Hopefully that works? If anyone else has any ideas that would be great

Potential Problem 2: Identifying Edge Case in NaCl I think regardless of the solution we come up with here, we'll need the NaCl signal mechanism to identify that we're executing this edge case. From there we can either 1) use the same sigtrap program counter advance we used to go to the next 32 byte boundary 2) Actually fix the edge case here. (2 is probably the better option as I'm not sure 1 will leave us enough 32 byte alignment options for signal returns)

All functions will start with a 32-byte aligned address and with the first two instructions being the prologue:

push   %rbp
mov    %rsp,%rbp

Can we reliably check if the pc is 32-byte aligned and those are the next two instructions to identify this case? I feel like it may be possible that this could be true at other places in the program? If so, is that a problem?

We're going to keep experimenting here, but @moyix @jesings @JustinCappos would love any input you guys might have

rennergade commented 1 year ago

So I've identified the problem and believe I have a potential solution though some of the logistics still need to be figured out.

When setting up the signal handler. We allocate an exception frame on the stack, so do so we need to calculate an appropriate new stack pointer address. In practice it looks like this:

new_stack_ptr = regs->stack_ptr - NACL_STACK_RED_ZONE - 8; //the -8 to standardize things between trusted and untrusted
  new_stack_ptr -= sizeof(struct NaClExceptionFrame) - NACL_STACK_PAD_BELOW_ALIGN;
  **new_stack_ptr = new_stack_ptr & ~NACL_STACK_ALIGN_MASK;**
  new_stack_ptr -= NACL_STACK_ARGS_SIZE;
  new_stack_ptr -= NACL_STACK_PAD_BELOW_ALIGN;

The NACL_STACK_ALIGN_MASK is used to ensure 16 byte alignment for the stack, which is necessary for SSE instructions to work correctly.

Until we encountered this bug/segfault we didn't come across any situations where the stack was not 16byte aligned while handling (or if we did they didn't cause an error) however here we push the return address onto the stack right before the signal is caught. This means 1. the stack is not 16-byte aligned (we just pushed 8 bytes), and 2. that address is necessary because it holds the address the PC needs to go to after the function call.

This mask operation is causing the program later to attempt to jmp to an address on the stack that where rsp is off by 8 bytes in the function @jesings created "NaClTrampolineRegRestore" (code below). The lea 136... insn is what restores the stack ptr which is off by the amount RED_ZONE + PAD_ALIGN (which is 136).

It's trivial to store the offset of the align mask to use to later restore correctly but actually adding that to the stack ptr at the end is difficult since all the registers are in use. To fix this we'll need to find a way to manipulate the amount of bytes that lea takes to take in account for the 8byte offset.

It's also possible (and probably likely, though we haven't seen it in practice ) that this bug could be extended to situations with a 4-byte offset. I'm still trying to understand the x86 conventions but seemingly it also could extend to all 1-15 byte offsets...

NaClTrampolineRegRestore:
# restore sigprocmask using custom syscall
# we must return to a 32 byte address,so pad with nops
.nops 23
    lea 130(%rip), %rax
    call *%rax # 128 + 2 where instruction is 2 bytes for this indirect call, this points to the address of the NaClSysSigmaskSigreturn trampoline entry
# we do this as an indirect call so that the call being at the end of a 32 byte section is easier
.align 32
    addl $64, %esp /* skip unused fields */
    addq %r15, %rsp
    popq %rax
    popq %rcx
    popq %rdx
    popq %rbx
    popq %rsi #we avoid restoring rsp in this manner, so pop to a scratch register to be clobbered
    popq %rsi #use a scratch register for rbp too
    movl %esi, %ebp
    addq %r15, %rbp
    popq %rsi
    popq %rdi
    popq %r8
    popq %r9
    popq %r10
    popq %r11
    popq %r12
    popq %r13
.align 32
    addl $16, %esp # skip restoring r14, r15
    addq %r15, %rsp
    popq %r14
    addl $24, %esp #skip flags for now, other unused fields
    addq %r15, %rsp
.align 32
#    andl $0xffffffe0, %r14d
    movl %r14d, %r14d
    addq %r15, %r14
    andl $0x7ff, (%rsp) #this is cleared by the system anyway, but can't hurt
    popfq #now restore flags from the duplicate location
    leaq 136(%rsp), %rsp
    jmp *%r14
rennergade commented 11 months ago

So the above problem has been fixed. Sadly I've re-encounted the bug in hashbrown::iter() that is seemingly hard to identify. It also only happens sometimes depending on compilation it seems.

I tried to figure out how to actually determine state when the bug hits for a while since its a huge application and I can't recreate the bug in a debugger. I finally figured out how to enable coredumps in docker and was able to get a coredump and print info in gdb, I've put the threaded back trace and register state into this document: https://docs.google.com/document/d/1Z7GrSlxQlp6Vr6iYTCKtZR4mqFUFsaKtsC1Thj1I6Rk/edit

rennergade commented 8 months ago

This was finally implemented with the mega-merge. Thank you to @jesings @AlpacaMax and all that helped here!