canonical / tdx

Intel confidential computing - TDX
GNU General Public License v3.0
80 stars 31 forks source link

VMs stop working after 2 weeks #149

Closed diegoara96 closed 1 week ago

diegoara96 commented 2 months ago

All of our VMs with more than 2 weeks of life time stopped working at once. We are still able to create new ones and they work but the old ones do not work at all. The log message we have is as follows:

BdsDxe: starting Boot0003 "Ubuntu" from HD(15,GPT,EE4033E9-4675-41F5-A94B-87EE37FDFD03,0x2800,0x35000)/\EFI\ubuntu\shimx64.efi
Loading Linux 6.8.0-35-generic ...
Loading initial ramdisk ...
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path

I don't know if there are any more log files that I can give you.

dmesg output

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-1004-intel root=/dev/mapper/ubuntu--vg-ubuntu--lv ro kvm_intel.tdx=1 nohibernate
[    0.862225] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-1004-intel root=/dev/mapper/ubuntu--vg-ubuntu--lv ro kvm_intel.tdx=1 nohibernate
[    1.759810] virt/tdx: BIOS enabled: private KeyID range [32, 64)
[    1.759812] virt/tdx: Disable ACPI S3. Turn off TDX in the BIOS to use ACPI S3.
[    7.416891] virt/tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 5, build_date 20240129, build_num 698
[    7.416895] virt/tdx: CMR: [0x100000, 0x77800000)
[    7.416897] virt/tdx: CMR: [0x100000000, 0x3ffe000000)
[    7.416898] virt/tdx: CMR: [0x4080000000, 0x8000000000)
[    8.749485] virt/tdx: 2084844 KB allocated for PAMT
[    8.749490] virt/tdx: module initialized
syncronize-issues-to-jira[bot] commented 2 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/PEK-769.

This message was autogenerated

diegoara96 commented 2 months ago

journalctl output

jun 24 11:07:17 tee-fhaas kernel: ------------[ cut here ]------------
jun 24 11:07:17 tee-fhaas kernel: WARNING: CPU: 76 PID: 2842 at arch/x86/kvm/vmx/tdx.c:275 __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrac>
jun 24 11:07:17 tee-fhaas kernel: CPU: 76 PID: 2842 Comm: vhost-2807 Tainted: G        W          6.8.0-1004-intel #11-Ubuntu
jun 24 11:07:17 tee-fhaas kernel: Hardware name: Dell Inc. PowerEdge R760/09XV41, BIOS 2.2.7 05/13/2024
jun 24 11:07:17 tee-fhaas kernel: RIP: 0010:__tdx_reclaim_page+0xac/0xe0 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel: Code: 48 8b 55 d0 65 48 2b 14 25 28 00 00 00 75 3c 48 83 c4 70 5b 41 5c 41 5d 41 5e 41 5f 5d 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc <0f> >
jun 24 11:07:17 tee-fhaas kernel: RSP: 0018:ff408c2824d7f660 EFLAGS: 00010282
jun 24 11:07:17 tee-fhaas kernel: RAX: c000030000000001 RBX: 0000000000000001 RCX: 0000000000000000
jun 24 11:07:17 tee-fhaas kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
jun 24 11:07:17 tee-fhaas kernel: RBP: ff408c2824d7f6f8 R08: 0000000000000000 R09: 0000000000000000
jun 24 11:07:17 tee-fhaas kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000289e01000
jun 24 11:07:17 tee-fhaas kernel: R13: 8000020000000001 R14: 8000020000000080 R15: ff408c2824d7f660
jun 24 11:07:17 tee-fhaas kernel: FS:  0000000000000000(0000) GS:ff1e960b7f500000(0000) knlGS:0000000000000000
jun 24 11:07:17 tee-fhaas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jun 24 11:07:17 tee-fhaas kernel: CR2: 0000000000000000 CR3: 0000003b0be3c006 CR4: 0000000000f73ef0
jun 24 11:07:17 tee-fhaas kernel: PKRU: 55555554
jun 24 11:07:17 tee-fhaas kernel: Call Trace:
jun 24 11:07:17 tee-fhaas kernel:  <TASK>
jun 24 11:07:17 tee-fhaas kernel:  ? show_regs+0x6d/0x80
jun 24 11:07:17 tee-fhaas kernel:  ? __warn+0x89/0x160
jun 24 11:07:17 tee-fhaas kernel:  ? __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel:  ? report_bug+0x17e/0x1b0
jun 24 11:07:17 tee-fhaas kernel:  ? handle_bug+0x51/0xa0
jun 24 11:07:17 tee-fhaas kernel:  ? exc_invalid_op+0x18/0x80
jun 24 11:07:17 tee-fhaas kernel:  ? asm_exc_invalid_op+0x1b/0x20
jun 24 11:07:17 tee-fhaas kernel:  ? __tdx_reclaim_page+0xac/0xe0 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel:  tdx_sept_drop_private_spte+0x26f/0x2f0 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel:  tdx_sept_remove_private_spte+0x3f/0x50 [kvm_intel]
jun 24 11:07:17 tee-fhaas kernel:  handle_removed_private_spte+0x1b4/0x260 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  handle_changed_spte+0x36c/0x850 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  handle_removed_pt+0x1b1/0x340 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  handle_changed_spte+0x5e2/0x850 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  handle_removed_pt+0x1b1/0x340 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  handle_changed_spte+0x5e2/0x850 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
jun 24 11:07:17 tee-fhaas kernel:  tdp_mmu_set_spte+0x111/0x240 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  __tdp_mmu_zap_root+0x1ee/0x210 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  kvm_tdp_mmu_zap_all+0x3e/0x90 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  kvm_arch_flush_shadow_all+0x103/0x110 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  kvm_mmu_notifier_release+0x2f/0x60 [kvm]
jun 24 11:07:17 tee-fhaas kernel:  __mmu_notifier_release+0x7b/0x200
jun 24 11:07:17 tee-fhaas kernel:  ? sched_clock_noinstr+0x9/0x10
jun 24 11:07:17 tee-fhaas kernel:  exit_mmap+0x3a2/0x3e0
jun 24 11:07:17 tee-fhaas kernel:  __mmput+0x41/0x140
jun 24 11:07:17 tee-fhaas kernel:  mmput+0x31/0x40
jun 24 11:07:17 tee-fhaas kernel:  exit_mm+0xbe/0x130
jun 24 11:07:17 tee-fhaas kernel:  do_exit+0x273/0x530
jun 24 11:07:17 tee-fhaas kernel:  vhost_task_fn+0xc6/0xd0
jun 24 11:07:17 tee-fhaas kernel:  ? __pfx_vhost_task_fn+0x10/0x10
jun 24 11:07:17 tee-fhaas kernel:  ret_from_fork+0x44/0x70
jun 24 11:07:17 tee-fhaas kernel:  ? __pfx_vhost_task_fn+0x10/0x10
jun 24 11:07:17 tee-fhaas kernel:  ret_from_fork_asm+0x1b/0x30
jun 24 11:07:17 tee-fhaas kernel: RIP: 0033:0x0
jun 24 11:07:17 tee-fhaas kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
jun 24 11:07:17 tee-fhaas kernel: RSP: 002b:0000000000000000 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
jun 24 11:07:17 tee-fhaas kernel: RAX: 0000000000000000 RBX: 0000000000000023 RCX: 00007bb5a3d24ded
jun 24 11:07:17 tee-fhaas kernel: RDX: 0000000000000000 RSI: 000000000000af01 RDI: 0000000000000023
jun 24 11:07:17 tee-fhaas kernel: RBP: 00007ffeddd48c90 R08: 00007ffeddd48d80 R09: 0000000000000000
jun 24 11:07:17 tee-fhaas kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffeddd48d80
jun 24 11:07:17 tee-fhaas kernel: R13: 000056341da3d0f8 R14: 0000000000000000 R15: 0000563421e8c788
jun 24 11:07:17 tee-fhaas kernel:  </TASK>
jun 24 11:07:17 tee-fhaas kernel: ---[ end trace 0000000000000000 ]---
 error interno: QEMU unexpectedly closed the monitor (vm='td_guest-controlkube-cc4bb21a-1e62-492f-b132-de26d869d8cb'): 2024-06-24T qemu-system-x86_64: can't open backing store /var/lib/libvirt/qemu/ram/10-td_guest-controlkube/pc.ram for guest RAM: Permission denied

the last one is because the file does not exist

diegoara96 commented 2 months ago

Hi @hector-cao were you able to check it? Any news ?

hector-cao commented 2 months ago

@diegoara96 Hello, we are still working internally to figure out the issue, please be sure that we take this issue seriously. I will keep you posted as soon as we have some news

bktan8 commented 1 month ago

@diegoara96 - can you tell me if you had all of the attestation components (DCAP, ITA) installed as well?

diegoara96 commented 1 month ago

@bktan8 when the problem happened we were using release 2.0 with everything by default and the attestation packages installed with the setup-attestation-guest.sh script.

diegoara96 commented 1 week ago

We have not experienced this error again, so I understand that it can be closed.

bktan8 commented 1 week ago

we can't reproduce either. Thanks @diegoara96!