x86: Shadow stacks compatibility for interceptor

yjugl commented 7 months ago

This patch makes interceptor code compatible with Intel CET shadow stacks.

Make the on_leave_trampoline a legit address to return to from a CET-enabled intercepted function's point of view.
Ensure a proper call/ret discipline to not interfere with shadow stacks.

On Windows, this means that it will now be possible to intercept functions that live in CETCOMPAT modules when the user shadow stack mitigation is active (though not in strict mode).

yjugl commented 7 months ago

I have tests failing locally and I'm taking a look. It seems that INCSSPQ yields an illegal instruction if we're on a CPU that supports shadow stacks but we are not currently using shadow stacks, so I would have to check IA32_U_CET.SH_STK_EN before using that instruction. Edit: Oh, this code path is broken anyway. I'm not popping on_leave_trampoline but the current return address. This needs a rewrite.

yjugl commented 7 months ago

The current patch works on my machine (which has CET shadow stacks support) and still lets me intercept functions from CETCOMPAT modules in processes that use the user shadow stack mitigation on Windows. It is now ready for review.

yjugl commented 5 months ago

Yay, this is awesome! 🤩 Thanks a lot for doing this.

Since it took me so long to review due to a rabbit-hole that kept me busy for way too long, I went ahead and made the final tweaks needed to merge this (only a few minor tweaks). While doing that I noticed the CI acting up, so I'm going to get to the bottom of that and remove those debug commits before merging.

Hey, you're welcome! Thanks for reviewing and integrating the patch. Let me know if some points are still unclear. Actually there are two points I would like to emphasize here.

First, I took a somewhat arbitrary design decision. The question is essentially, do we want all users to run the same code always, although that includes instructions that are useless if the mitigation is not active? There are essentially three possibilities here:

(a) ensure proper call/ret discipline for all x86 users always ; (b) ensure proper call/ret discipline for all x86 users for which we detect that the CPU supports shadow stacks, even if the current process is not making active use of the mitigation ; (c) ensure proper call/ret discipline only if the current process makes active use of the mitigation.

I would say (a) is the best long-term solution from a maintenance and stability point of view because it guarantees that the code that users run takes exactly the same code paths that were run during tests no matter what. (c) would be the exact opposite, only running the extra instructions when necessary, at the cost that you may discover subtle bugs at some point because the tests ran without shadow stack support and the user runs Frida in a process that has them, or vice versa. (c) also requires support from the operating system to know if the mitigation is currently active, so it is a bit more work on the portability and maintenance side.

The current patch implements (b) which is some kind of a compromise based on the idea that if your CPU is recent enough to support shadow stacks, it's probably also fast enough to run the extra instructions without any significant performance cost and as time passes and people change their CPUs, (b) gets closer to (a) anyway. I would probably have gone for (a) if I owned the library, because I don't think these extra instructions will actually lead to a significant performance regression for anyone, but I went for (b) because I can't guarantee that. If you think (a) or (c) matches better with the goals for Frida, we could adapt the patch.

Second, we're replacing a ret with a lea rsp, [rsp+8]; jmp [rsp-8] in gum_emit_epilog under GUM_POINT_ENTER. I think that should be fine, but if you encounter subtle issues with some tools or processes, that would be something to double check. For example it would not be fine to do that in kernel mode, where an interrupt between the two instructions could end up replacing the value at rsp-8 without restoring it. It may be the case that some userland tools like profilers would also assume that the stack in not used at negative offsets from rsp; or that some programs would rely on that assumption for themselves when they suspend their own threads; I'm not 100% sure. We may want to use a pop rax; jmp rax instead but I wasn't sure if any register could be used 100% reliably here either, as we could be jumping to a function that has a weird calling convention. So I didn't really find a more satisfying solution.

oleavr commented 5 months ago

Hey, you're welcome! Thanks for reviewing and integrating the patch.

:raised_hands:

Let me know if some points are still unclear. Actually there are two points I would like to emphasize here.

First, I took a somewhat arbitrary design decision. The question is essentially, do we want all users to run the same code always, although that includes instructions that are useless if the mitigation is not active? There are essentially three possibilities here: (...)

I totally agree that a) is the ideal from a maintenance and stability point of view. But I think we should strive to reduce our overhead over time, so we can minimize our impact on timings, to be able to observe a system without changing its behavior more than necessary. So I think I would lean towards c). Given that Interceptor is used for a lot of things and these trampolines are common to all use-cases, it seems worth it to trade some simplicity for speed.

Second, we're replacing a ret with a lea rsp, [rsp+8]; jmp [rsp-8] in gum_emit_epilog under GUM_POINT_ENTER. I think that should be fine, but if you encounter subtle issues with some tools or processes, that would be something to double check. For example it would not be fine to do that in kernel mode, where an interrupt between the two instructions could end up replacing the value at rsp-8 without restoring it. It may be the case that some userland tools like profilers would also assume that the stack in not used at negative offsets from rsp; or that some programs would rely on that assumption for themselves when they suspend their own threads; I'm not 100% sure. We may want to use a pop rax; jmp rax instead but I wasn't sure if any register could be used 100% reliably here either, as we could be jumping to a function that has a weird calling convention. So I didn't really find a more satisfying solution.

Good point! I think this is fine for now. We currently ensure that we don't clobber any registers or CPU flags on x86, and I think it is important to retain this guarantee as Interceptor does indeed support instruction-level probes as well as targets with custom calling conventions.

yjugl commented 5 months ago

I totally agree that a) is the ideal from a maintenance and stability point of view. But I think we should strive to reduce our overhead over time, so we can minimize our impact on timings, to be able to observe a system without changing its behavior more than necessary. So I think I would lean towards c). Given that Interceptor is used for a lot of things and these trampolines are common to all use-cases, it seems worth it to trade some simplicity for speed.

In that case, let me expand a bit on (c). Another issue with (c) is the following: if a process is currently not using the mitigation, could the mitigation become active later in that same process? Because if that is possible, then (c) is dangerous: we could be generating code now that would not be correct anymore at a later point in the process.

My testing so far suggests that on Windows at least, either a process has the mitigation from the start, or it will never have it. That can be specified at process creation using UpdateProcThreadAttribute with PROCESS_CREATION_MITIGATION_POLICY2_CET_USER_SHADOW_STACKS_*. It does not seem possible to opt-in for the mitigation during the life of the process, which could use SetProcessMitigationPolicy with ProcessUserShadowStackPolicy but that gets rejected. It seems only GetProcessMitigationPolicy would be valid for that policy. If that is correct, then (c) could make sense on Windows. I don't know if that generalizes to other platforms however.

oleavr commented 5 months ago

In that case, let me expand a bit on (c). Another issue with (c) is the following: if a process is currently not using the mitigation, could the mitigation become active later in that same process?

Good point! It seems unlikely, but I don't know either.

If that is correct, then (c) could make sense on Windows. I don't know if that generalizes to other platforms however.

Cool! I suppose we could do that for Windows only initially, and expand on it later as we learn more about the other platforms' internals.

frida / frida-gum

x86: Shadow stacks compatibility for interceptor #791