Perform CR4 and CR0 pinning from hypervisors

kees commented 4 years ago

Similar to the recent CR4 and CR0 pinning done in the native_write_cr*() functions, it would be even better to have KVM (and other hypervisors) pin those bits for their guest kernels too.

pdxjohnny commented 4 years ago

Hi. I've been working on this here: https://github.com/torvalds/linux/compare/master...pdxjohnny:hc_harden almost ready to send out for review.

kees commented 4 years ago

Ah! Very cool. I wonder, though, if it would be better to make this entire a host based defense? i.e. do not involve the guest at all and just detect when special bits are set and keep them set? Or maybe, have a default set of bits and let the guest add more? (I'd like this defense to work "out of the box" for all guests without having the guest need to know to make a hypercall to gain the protection.)

pdxjohnny commented 4 years ago

Moving discussion here (CC: @redgecombe @kaccardi)

With regards to the Container as VM usecase. I noticed that ChromeOS does Linux support in a VM with a hardened config. Is this something that would be enabled there if implemented? We're taking this through review internally and was told an "ack from those [chromeos] folks would go a long way"

Ah! Very cool. I wonder, though, if it would be better to make this entire a host based defense? i.e. do not involve the guest at all and just detect when special bits are set and keep them set? Or maybe, have a default set of bits and let the guest add more? (I'd like this defence to work "out of the box" for all guests without having the guest need to know to make a hypercall to gain the protection.)

I like this idea. I'm not sure how we could convince upstream to take that though, so far when we'd thought about that we'd come to the conclusion that we don't know what other kernels people might be running under KVM that would do whacky things like flip those bits intentionally. In which case the KVM people might not be so keen on us breaking functionality for those.

But minimally I'd guess that patch would pretty much be:

int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
{
     unsigned long old_cr4 = kvm_read_cr4(vcpu);
     unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
                                X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;

     if (kvm_valid_cr4(vcpu, cr4))
             return 1;

     if (~cr4 & old_cr4 & (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP))
             return 1;

    /* ... */

We could make some sort of IOCTL to have QEMU, etc. request that we run in a once-turned-on-never-turn-off mode. The downside to this is that now we have to go touch QMEU, firecracker, etc. Whereas the hypercall approach is pretty self contained within the kernel (aside from the one line QEMU cpuid bit patch). I'm not sure what the most upstreamably friendly approach his here though.

As it stands right now the patch is a hypercall that allows for pinning SMEP, SMAP, UMIP, and WP. The same bits you had already pinned. The initial approach had been to allow pinning of arbitrary bits. That was met with skepticism over the TS and MP bits in CR0. So I think it might be easiest to upstream if we first allow for only the pinning of the bits we already pin in software and then we could add the ability to pin others if we care to do that later in a different patch.

I'm also meeting with skepticism around KEXEC. I think I'll try to debug what's up with the early boot code clearing those bits to see if we can make this an always on option. If it works with KEXEC, then we probably don't need a CONFIG_ for it, since we can argue that it's functionally equivalent to existing pinning and therefore should be on by default.

pdxjohnny commented 4 years ago

Kexec is working!

pdxjohnny commented 4 years ago

Well shit, I spoke too soon. The patch I have to the early boot code works in KVM, but not on my physical host... If I can't get it working by mid next week I'll just keep the Kconfig option and make it !KEXEC for the public RFC.

CR0.WP and CR4.SMEP work, still trying to narrow down what's causing it not to boot when on the physical host (never makes it to first console prints to screen, haven't hooked up serial).

Setting CR4.UMIP or CR4.SMAP causes it not to boot. I think this is because of EFLAGS.AC. Either way it's wednesday and so I'm giving up on Kexec for now :/

Also, I finally ran across the code that's really supposed to be enabling SMAP / UMIP and I see that surprise surprise there is a well thought out existing process for enabling them. It was also pointed out that if we kexec to a new kernel the new kernel might be an older kernel that doesn't know it shouldn't turn off those bits. In which case it would blow up. So the boot time disablable kexec is probably the safest and most usable option long term.

pdxjohnny commented 4 years ago

Just an update. Almost done addressing all RFCv1 comments to send out RFCv2. One of the comments was that we need to ensure SMIs can't modify pinned bits. Which lead to the discovery that AMD and Intel SDMs specify that the CR* fields within SMRAM are readonly. So we shouldn't be able to modify them in the first place. Should be done with the patch to make sure they are readonly soon. It'll be the new first patch in the pv cr pinning patchset.

pdxjohnny commented 4 years ago

Update, PATCH sent: https://lkml.org/lkml/2020/6/17/921

pdxjohnny commented 3 years ago

Just an update on this. Intel has a review process that I'm gated by right now. I have the next version ready but haven't been able to get a review. https://github.com/pdxjohnny/linux/tree/patch_v2_internal_v6

pdxjohnny commented 2 years ago

If someone wants to take this and put a co-authored by on it that works for me. Otherwise I probably need another probably 6 months to get this rebased and out the door, provided there are no major issues with the rebase. I've been working on getting Intel proper CI/CD so that the review process can scale.

bwendling commented 2 years ago

I can give a shot at this.

pdxjohnny commented 2 years ago

@gwelymernans Want to have a meeting sometime I can go over the patchset as it was while ago and the test cases?

bwendling commented 2 years ago

@pdxjohnny That's a good idea. :-) Are you available Wednesday? My schedule's fairly open then or later in the week.

l0kod commented 2 years ago

I want to help too!

pdxjohnny commented 2 years ago

Meeting invite sent for 10-11 AM PDT. Please propose a new time if that doesn't work for either of you. I am free most afternoons. I'll try to have instructions on how to run the tests. If either of you want a tar of the filesystem on the SUT I had I can tar it up, but I'd need you to give me a network location to dump it to. It's an at this point old fedora image with the git repos and userspace dependencies installed.

Meeting Link: https://meet.google.com/gys-epwh-avc

If you both could please read over coverletter and the patchset beforehand that would be great too, as it's dense at some points and we'll get the most out of our meeting it you've at least skimmed it.

l0kod commented 2 years ago

Great! I would prefer to start 2 or 1 hour earlier if it works for you (I'm UTC+1).

pdxjohnny commented 2 years ago

I can start as early as 8 AM PDT

pdxjohnny commented 2 years ago

Or we can do before. 6-7 AM PDT, but I have a hard conflict 7-8 AM PDT every day. I originally made it two hours but if you both read that stuff beforehand we should be able to do it in an hour.

pdxjohnny commented 2 years ago

Also, please make sure you have a machine ready for this. I'm pretty sure anything within the past 10 years will work so long as it has virtualization. Ideally one of you would have an Intel and one of you would have an AMD then we could validate the AMD support and get you both spun up all at once. I think if you have a machine remotely accessible via AMT or BMC you could do it too, you just need enough to help you re-image the OS and interact with the BIOS menu on the host.

Proposed agenda:

Introductions
Q&A
Get existing code at old point in git history running

l0kod commented 2 years ago

Correction: 10am PDT will be perfect for me Wednesday.

rpedgeco commented 2 years ago

Something related came up recently with the kernel IBT (Indirect Branch Tracking) patchset, which is not upstream but looks to be getting close. Kernel IBT enforcement wants to wants to have its CR4 bit pinned, but it caused problems with kexec. See this thread for some more details: https://lore.kernel.org/lkml/eed8902f21ba9e5f93562432f6b5920137860a98.camel@intel.com/

The working solution is to unset CR0.WP before CR4.CET in the old kexec-ing kernel, inlined with no pinning check. I was thinking it could cause further problems for KVM pinning, so just wanted to point it out here. Please feel free to ping me if you have questions.

pdxjohnny commented 2 years ago

kvm/next for target for next patchset
run.sh runs the integration tests
test setup
- Machine in front of you with virtuatialization support
https://web.archive.org/web/20191105205408/http://events19.linuxfoundation.org/wp-content/uploads/2017/12/Improving-KVM-x86-Nested-Virtualization-Liran-Alon-Oracle.pdf
https://github.com/pdxjohnny/kvm-unit-tests

pdxjohnny commented 2 years ago

Sorry guys I got dropped from the call and have to join another call

For the triple fault, it might be related to the setup which I copied from other tests
- I am not sure what the ACPI stuff happens
- https://github.com/l0kod/linux/commit/cb0148133b4e607b0a9872398dbf934730a09497#diff-b9503de1ae72a27304b70f3f192f86c875cfd66e05bc555eb974df55c2efa270R77-R83

graph TD
  run_tests[Run all the tests]

  subgraph L0[L0 - host]
     kvm_unit_tests
     qemu[QEMU]
     kexec_tools[kexec-tools]
  end

  subgraph L1[L2 - guest]
    subgraph L1_fs[L1 - guest - filesystem]
      guest_qemu[QEMU]
      qemu_kexec_tools[kexec-tools]
    end
  end

  subgraph L2[L2 - guest of the guest]
    L2_fs
  end

  run_tests -->|QEMU = L0_qemu| kvm_unit_tests

  qemu -->|copy| guest_qemu
  kexec_tools-->|copy| qemu_kexec_tools

  guest_qemu -->|virtio9p| L2_fs
  qemu_kexec_tools-->|virtio9p| L2_fs

Always ensure we are using QEMU with patches from series
- https://github.com/pdxjohnny/qemu/commits/41356f28cad1f54d84cb81ce8bfdbbfef29ad68e
- https://github.com/pdxjohnny/qemu/commit/b74bacdf244154059e937454365f1dcc2bfc8b97
- https://github.com/pdxjohnny/qemu/commit/41356f28cad1f54d84cb81ce8bfdbbfef29ad68e
- kvm-unit-tests need to be run with the version of QEMU you built from these patches
- https://github.com/pdxjohnny/protecting_cr4_host_setup/blob/92c4238e02bb5fe9f60572137a4ec6a2ddcd9450/test.sh#L62
- Need to copy QEMU into L1 and then L2 in integration tests passes through L1 fs via 9p
Ensure kexec-tools
L2 tests
- https://github.com/pdxjohnny/protecting_cr4_host_setup/blob/master/chroot/home/johnsa1/run.sh
- https://github.com/pdxjohnny/protecting_cr4_host_setup/blob/92c4238e02bb5fe9f60572137a4ec6a2ddcd9450/test.sh#L124-L149

l0kod commented 2 years ago

The conflict that may be the cause of the issue in my initial rebase is in arch/x86/kvm/emulate.c where post_leave_smm() was patched: https://github.com/l0kod/linux/commit/6b49d7b80fb9e64ac3ae43b03e0a5ac9daa90e89#diff-50a0b0edc05bf5218443999c4f0c82b96115995a68dca707080331ff3190b937

isanbard commented 2 years ago

FYI, @l0kod, I had to apply this patch to your tree to get it to compile without this feature.

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 86f66d8a7817..44c183bad720 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -168,7 +168,9 @@ static inline void __init kvm_paravirt_cr_pinning_init(void)
 }

 static inline void kvm_setup_paravirt_cr_pinning(unsigned long cr0_pinned_bits,
-                                                unsigned long cr4_pinned_bits)
+                                                unsigned long cr0_pinned_mask,
+                                                unsigned long cr4_pinned_bits,
+                                                unsigned long cr4_pinned_mask)
 {
 }
 #endif

l0kod commented 2 years ago

That is indeed an issue but it doesn't come from me. :) Actually we didn't notice this because we compile with CONFIG_KVM_GUEST, and you should too. ;)

l0kod commented 1 year ago

We are working on a new set of KVM features to harden the kernel, and we took some of the CR pinning ideas but followed a much simpler path (i.e. without VMM/IOCTL support): https://lore.kernel.org/all/20230505152046.6575-6-mic@digikod.net/ See the cover letter for a detailed explanation of Heki: https://lore.kernel.org/all/20230505152046.6575-1-mic@digikod.net/

@pdxjohnny, your patch series could still be useful in the future, let's see how it goes. :wink:

Feel free to join the discussion on the mailing list, there is a lot to do and we're looking for contributors!

pdxjohnny commented 1 year ago

Woohoo!

We had avoided the hypercall route in order to support live migration

KSPP / linux

Perform CR4 and CR0 pinning from hypervisors #19