AMDESE / AMDSEV

AMD Secure Encrypted Virtualization
299 stars 87 forks source link

SEV-SNP VM randomly crashes #245

Open danko-miladinovic opened 2 days ago

danko-miladinovic commented 2 days ago

Hi everyone,

I booted a VM using SEV-SNP, Linux kernel 6.11 and QEMU 9.1.50. The QEMU command is:

$QEMU_BIN \
  -enable-kvm \
  -machine q35 \
  -cpu EPYC-v4 \
  -smp 4,maxcpus=16 \
  -m 25G,slots=5,maxmem=30G \
  -netdev user,id=vmnic \
  -device virtio-net-pci,disable-legacy=on,iommu_platform=true,netdev=vmnic,romfile= \
  -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=13 \
  -machine confidential-guest-support=sev0,memory-backend=ram1 \
  -bios ./OVMF.fd \
  -object memory-backend-memfd,id=ram1,size=25G,share=true,prealloc=false \
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,kernel-hashes=on \
  -kernel ./bzImage \
  -append "earlyprintk=serial console=ttyS0 rootfstype=ramfs" \
  -initrd ./rootfs.cpio.gz \
  -nographic \
  -monitor pty

Most of the time the VM boots fine, but in some cases I get an error like this one :

error: kvm run failed Invalid argument EAX=00000000 EBX=00000000 ECX=00000000 EDX=00800f12 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00009b00 SS =0000 00000000 0000ffff 00009300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00

GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=c5 5a 08 2d 00 00 00 00 00 00 00 00 00 00 00 00 56 54 46 00 <0f> 20 c0 a8 01 74 05 e9 2c ff ff ff

The quest kernel is 6.6. Does anybody know what the potential issue might be?

Thank you.

Kind regards, Danko

tlendacky commented 2 days ago

Do you have any other output just before the VM fails? Or is it immediate that it fails?

danko-miladinovic commented 1 day ago

Thank you for the fast response. I have tested it again and this time I see the kvm run error, like the one from above, but this time it happed when I was installing some python package.

This is the error this time:

error: kvm run failed Invalid argument EAX=00000000 EBX=00000000 ECX=00000000 EDX=00800f12 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000b004 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=1 ES =0000 00000000 0000ffff 00009300 CS =f000 00800000 0000ffff 00009b00 SS =0000 00000000 0000ffff 00009300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The OVMF that is being used is AmdSev from the edk2-stable202408 tag.

danko-miladinovic commented 1 day ago

In the dmesg log on the host I see:

[255267.410514] SEV-SNP: RMPUPDATE failed for PFN 56700, pg_level: 1, ret: 2
[255267.410540] SEV-SNP: PFN 0x56700 unassigned, dumping non-zero entries in 2M PFN region: [0x56600 - 0x56800]
[255267.410543] SEV-SNP: PFN: 0x56600, [0x80000000000007fd - 0x0000000000000000]
[255267.410545] SEV-SNP: PFN: 0x56601, [0x8000000000000005 - 0x0000000000000000]
[255267.410547] SEV-SNP: PFN: 0x56602, [0x8000000000000005 - 0x0000000000000000]
[255267.410548] SEV-SNP: PFN: 0x56603, [0x8000000000000005 - 0x0000000000000000]
[255267.410550] SEV-SNP: PFN: 0x56604, [0x8000000000000005 - 0x0000000000000000]
[255267.410551] SEV-SNP: PFN: 0x56605, [0x8000000000000005 - 0x0000000000000000]
[255267.410553] SEV-SNP: PFN: 0x56606, [0x8000000000000005 - 0x0000000000000000]
[255267.410554] SEV-SNP: PFN: 0x56607, [0x8000000000000005 - 0x0000000000000000]
[255267.410555] SEV-SNP: PFN: 0x56608, [0x8000000000000005 - 0x0000000000000000]
[255267.410557] SEV-SNP: PFN: 0x56609, [0x8000000000000005 - 0x0000000000000000]
[255267.410558] SEV-SNP: PFN: 0x5660a, [0x8000000000000005 - 0x0000000000000000]
[255267.410560] SEV-SNP: PFN: 0x5660b, [0x8000000000000005 - 0x0000000000000000]
[255267.410561] SEV-SNP: PFN: 0x5660c, [0x8000000000000005 - 0x0000000000000000]
[255267.410562] SEV-SNP: PFN: 0x5660d, [0x8000000000000005 - 0x0000000000000000]
[255267.410564] SEV-SNP: PFN: 0x5660e, [0x8000000000000005 - 0x0000000000000000]
[255267.410565] SEV-SNP: PFN: 0x5660f, [0x8000000000000005 - 0x0000000000000000]
[255267.410567] SEV-SNP: PFN: 0x56610, [0x8000000000000005 - 0x0000000000000000]
[255267.410568] SEV-SNP: PFN: 0x56611, [0x8000000000000005 - 0x0000000000000000]
[255267.410569] SEV-SNP: PFN: 0x56612, [0x8000000000000005 - 0x0000000000000000]
[255267.410571] SEV-SNP: PFN: 0x56613, [0x8000000000000005 - 0x0000000000000000]

....

[255903.430340] SEV-SNP: PFN: 0x567e9, [0x0000000000000004 - 0x0000000000000000]
[255903.430342] SEV-SNP: PFN: 0x567ea, [0x0000000000000004 - 0x0000000000000000]
[255903.430344] SEV-SNP: PFN: 0x567eb, [0x0000000000000004 - 0x0000000000000000]
[255903.430346] SEV-SNP: PFN: 0x567ec, [0x0000000000000004 - 0x0000000000000000]
[255903.430348] SEV-SNP: PFN: 0x567ed, [0x0000000000000004 - 0x0000000000000000]
[255903.430350] SEV-SNP: PFN: 0x567ee, [0x0000000000000004 - 0x0000000000000000]
[255903.430352] SEV-SNP: PFN: 0x567ef, [0x0000000000000004 - 0x0000000000000000]
[255903.430354] SEV-SNP: PFN: 0x567f0, [0x0000000000000004 - 0x0000000000000000]
[255903.430356] SEV-SNP: PFN: 0x567f1, [0x0000000000000004 - 0x0000000000000000]
[255903.430358] SEV-SNP: PFN: 0x567f2, [0x0000000000000004 - 0x0000000000000000]
[255903.430360] SEV-SNP: PFN: 0x567f3, [0x0000000000000004 - 0x0000000000000000]
[255903.430362] SEV-SNP: PFN: 0x567f4, [0x0000000000000004 - 0x0000000000000000]
[255903.430364] SEV-SNP: PFN: 0x567f5, [0x0000000000000004 - 0x0000000000000000]
[255903.430366] SEV-SNP: PFN: 0x567f6, [0x0000000000000004 - 0x0000000000000000]
[255903.430368] SEV-SNP: PFN: 0x567f7, [0x0000000000000004 - 0x0000000000000000]
[255903.430370] SEV-SNP: PFN: 0x567f8, [0x0000000000000004 - 0x0000000000000000]
[255903.430372] SEV-SNP: PFN: 0x567f9, [0x0000000000000004 - 0x0000000000000000]
[255903.430374] SEV-SNP: PFN: 0x567fa, [0x0000000000000004 - 0x0000000000000000]
[255903.430376] SEV-SNP: PFN: 0x567fb, [0x0000000000000004 - 0x0000000000000000]
[255903.430378] SEV-SNP: PFN: 0x567fc, [0x0000000000000004 - 0x0000000000000000]
[255903.430380] SEV-SNP: PFN: 0x567fd, [0x0000000000000004 - 0x0000000000000000]
[255903.430382] SEV-SNP: PFN: 0x567fe, [0x0000000000000004 - 0x0000000000000000]
[255903.430384] SEV-SNP: PFN: 0x567ff, [0x0000000000000004 - 0x0000000000000000]
[255903.430433]  sev_gmem_prepare+0x11e/0x2b0 [kvm_amd]
[255903.430996] kvm_amd: SEV: Failed to update RMP entry: GFN 899f PFN 56701 level 1 error -14

and

[252872.779419] Call Trace:
[252872.779421]  <TASK>
[252872.779426]  show_stack+0x49/0x60
[252872.779432]  dump_stack_lvl+0x5f/0x90
[252872.779437]  dump_stack+0x10/0x18
[252872.779440]  rmpupdate.cold+0x37/0x3c
[252872.779444]  rmp_make_private+0x3f/0x70
[252872.779451]  sev_gmem_prepare+0x11e/0x2b0 [kvm_amd]
[252872.779462]  kvm_arch_gmem_prepare+0x17/0x30 [kvm]
[252872.779517]  kvm_gmem_prepare_folio+0x18a/0x200 [kvm]
[252872.779547]  ? __kvm_gmem_get_pfn+0xcf/0x170 [kvm]
[252872.779575]  kvm_gmem_get_pfn+0xde/0x110 [kvm]
[252872.779602]  __kvm_faultin_pfn+0x156/0x4b0 [kvm]
[252872.779641]  kvm_faultin_pfn+0x110/0x2b0 [kvm]
[252872.779672]  kvm_tdp_page_fault+0x8e/0xe0 [kvm]
[252872.779702]  kvm_mmu_do_page_fault+0x24f/0x280 [kvm]
[252872.779733]  kvm_mmu_page_fault+0x8d/0x240 [kvm]
[252872.779761]  npf_interception+0xb9/0x190 [kvm_amd]
[252872.779767]  ? __pfx_npf_interception+0x10/0x10 [kvm_amd]
[252872.779773]  svm_invoke_exit_handler+0x15f/0x1c0 [kvm_amd]
[252872.779779]  svm_handle_exit+0x17d/0x220 [kvm_amd]
[252872.779785]  ? svm_vcpu_run+0x3db/0x860 [kvm_amd]
[252872.779790]  vcpu_enter_guest+0x7ba/0x1000 [kvm]
[252872.779827]  ? apic_has_pending_timer+0x3e/0x80 [kvm]
[252872.779864]  vcpu_run+0x49/0x2a0 [kvm]
[252872.779896]  kvm_arch_vcpu_ioctl_run+0x121/0x4a0 [kvm]
[252872.779924]  ? do_madvise+0x218/0x470
[252872.779929]  kvm_vcpu_ioctl+0x249/0x890 [kvm]
[252872.779957]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.779961]  ? syscall_exit_to_user_mode+0x4e/0x250
[252872.779965]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.779968]  __x64_sys_ioctl+0xa3/0xf0
[252872.779973]  x64_sys_call+0x121b/0x22b0
[252872.779976]  do_syscall_64+0x7e/0x170
[252872.779981]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.779984]  ? vfs_fallocate+0x140/0x390
[252872.779987]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.779990]  ? __x64_sys_fallocate+0x71/0xa0
[252872.779993]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.779995]  ? syscall_exit_to_user_mode+0x4e/0x250
[252872.779998]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780001]  ? do_syscall_64+0x8a/0x170
[252872.780003]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780006]  ? kvm_vcpu_ioctl+0x1b8/0x890 [kvm]
[252872.780033]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780036]  ? kvm_vm_ioctl+0x893/0xb50 [kvm]
[252872.780063]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780067]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780069]  ? fire_user_return_notifiers+0x37/0x70
[252872.780073]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780076]  ? syscall_exit_to_user_mode+0x4e/0x250
[252872.780079]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780081]  ? do_syscall_64+0x8a/0x170
[252872.780084]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780087]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780090]  ? __x64_sys_ioctl+0xbb/0xf0
[252872.780093]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780095]  ? syscall_exit_to_user_mode+0x4e/0x250
[252872.780099]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780101]  ? do_syscall_64+0x8a/0x170
[252872.780103]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780106]  ? do_syscall_64+0x8a/0x170
[252872.780108]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780110]  ? irqentry_exit_to_user_mode+0x43/0x250
[252872.780113]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780116]  ? irqentry_exit+0x43/0x50
[252872.780118]  ? srso_alias_return_thunk+0x5/0xfbef5
[252872.780121]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[252872.780124]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[252872.780126] RIP: 0033:0x78a3e8d24ded
[252872.780130] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[252872.780132] RSP: 002b:0000789da3bff6e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[252872.780136] RAX: ffffffffffffffda RBX: 0000622d56f2e1e0 RCX: 000078a3e8d24ded
[252872.780138] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000010
[252872.780139] RBP: 0000789da3bff730 R08: 0000000000000000 R09: 0000000000000000
[252872.780141] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[252872.780142] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[252872.780147]  </TASK>
[252872.780159] kvm_amd: SEV: Failed to update RMP entry: GFN 692ec PFN 56701 level 1 error -14
[252872.780177] gmem: Failed to prepare folio for index 692ec GFN 692ec PFN 56701 error -22.
tlendacky commented 1 day ago

[255267.410571] SEV-SNP: PFN: 0x56613, [0x8000000000000005 - 0x0000000000000000]

Is there a line like this in dmesg for PFN 0x56700?

Can you paste the output of: dmesg | grep RMP

Does the kernel you are running have the following fix(es):

danko-miladinovic commented 1 day ago

Thanks. One stupid question. I should look for this changes on the host kernel, right? These fixes are to be done on the host kernel, right?

There is some problem on the server, some other messages are cloging up the demsg log, so I no longer see any log messages regarding PFN 0x56700. I do not have those fixes. I saw them today, so they will fix my problem? Potentionaly.

tlendacky commented 1 day ago

Thanks. One stupid question. I should look for this changes on the host kernel, right? These fixes are to be done on the host kernel, right?

Yes

There is some problem on the server, some other messages are cloging up the demsg log, so I no longer see any log messages regarding PFN 0x56700. I do not have those fixes. I saw them today, so they will fix my problem? Potentionaly.

Yes, assuming that the PFN in question was in the last 1MB of the 2MB range that encompasses the end of the RMP table. Patches 400fea4b9651 and d6d85ac15cce reserved the memory in order for kexec to work properly. However, without 88a921aa3c6b, that reserved memory was still made available for allocation. Since it is viewed as reserved in the e820 table, the SNP support will pass the address and range to SNP_INIT to make it hypervisor fixed, which is why the RMPUPDATE fails for any page allocated in that range. Patch 88a921aa3c6b prevents that memory from being allocatable.

But if you don't have any of those patches on the host, I'm not sure what is occurring.

danko-miladinovic commented 1 day ago

Sorry, the patches:

mdroth commented 1 day ago

snp-host-latest kernel tree already includes that patch so a new build should pick it up automatically:

https://github.com/AMDESE/linux/commit/718de610c49f35c32a7891b9b1c853dd037880c0