AMDESE / AMDSEV

AMD Secure Encrypted Virtualization
287 stars 86 forks source link

Linux APIs to allocate the shared memory with SEV-SNP #109

Open kpadwal opened 1 year ago

kpadwal commented 1 year ago

In the case of SEV guests, if we are not using SWIOTLB, then we can allocate shared pages through the Linux API dma_alloc_coherent(). Similarly, for sev-snp, how can we allocate the shared pages (unencrypted pages)?

tlendacky commented 1 year ago

It depends on what you are wanting to do. If you're in the kernel, then you can just do a kmalloc() followed by a set_memory_decrypted(). Before freeing that memory, though, you need to perform a set_memory_encrypted().

But, dma_alloc_coherent() works the same on SEV, SEV-ES and SEV-SNP.

kpadwal commented 1 year ago

Thanks for the quick response.

But, dma_alloc_coherent() works the same on SEV, SEV-ES and SEV-SNP.

I want to use dma_alloc_coherent(), but I have observed that force_dma_unencrypted() check for CC_ATTR_GUEST_MEM_ENCRYPT (sev and sev_es) and not for CC_ATTR_GUEST_SEV_SNP(sev_snp).

https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/mem_encrypt.c#L17

/ Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED / bool force_dma_unencrypted(struct device dev) { /

tlendacky commented 1 year ago

Thanks for the quick response.

But, dma_alloc_coherent() works the same on SEV, SEV-ES and SEV-SNP.

I want to use dma_alloc_coherent(), but I have observed that force_dma_unencrypted() check for CC_ATTR_GUEST_MEM_ENCRYPT (sev and sev_es) and not for CC_ATTR_GUEST_SEV_SNP(sev_snp).

CC_ATTR_GUEST_MEM_ENCRYPT => sev_status & MSR_AMD64_SEV_ENABLED, which is valid for SEV, SEV-ES and SEV-SNP.

https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/mem_encrypt.c#L17

/ Override for DMA direct allocation check - ARCH_HAS_FORCE_DMA_UNENCRYPTED / bool force_dma_unencrypted(struct device dev) { / For SEV, all DMA must be to unencrypted addresses. / if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT)) return true;

Also, set_memory_decrypted() validate the platform attribute CC_ATTR_MEM_ENCRYPT which represents SME, SEV, and SEV-ES. https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/pat/set_memory.c#L2077

CC_ATTR_MEM_ENCRYPT => sme_me_mask, which is non-zero for SME, SEV, SEV-ES and SEV-SNP.

kpadwal commented 1 year ago

I tried to allocate the memory through dma_alloc_coherent() but seeing the RMP_PAGE_FAULT with SEV-SNP, Looks like the page is owned by AMD-SP and in a pre-guest state ? As I am expecting if we are allocating the memory through dma_alloc_coherent(), it has to be hypervisor owned.

[Oct19 23:39] vfio-pci 0000:c1:00.0: AMD-Vi: Event logged [RMP_PAGE_FAULT vmg_tag=0x0000, gpa=0x743ca784, flags_rmp=0x0000, flags=0x0020]
[  +0.000012] BUG: unable to handle page fault for address: ffff96fcc01ea000
[  +0.000002] #PF: supervisor write access in kernel mode
[  +0.000003] #PF: error_code(0x80000003) - RMP violation
[  +0.000001] PGD 16bc201067 P4D 16bc201067 PUD 100226063 PMD 100228063 PTE 80000001001ea163
[  +0.000005] SEV-SNP: RMPEntry paddr 0x1001ea000 [assigned=1 immutable=1 pagesize=0 gpa=0x2000000001 asid=0 vmsa=0 validated=0]
tlendacky commented 1 year ago

So the dma_alloc_coherent() was performed in the SNP guest and the returned DMA address was supplied to the device? Do you have any logging of or around that? But I don't know how much help I'm going to be without a full picture of what and how everything is occurring.

kpadwal commented 1 year ago

Sorry for the delayed reply. I don't have it right now, I will gather more details and will share them.

kpadwal commented 1 year ago

I have set up SEV-SNP-enabled host and guest on Ubuntu 20.04 and trying to initialize the PCI device. For that, I have allocated the memory through dma_alloc_coherent() in the SNP guest and mapped that memory to the PCI device locally when the device is trying to update that memory, we are seeing the RMP_PAGE_FAULT on the host.

I have debugged the RMP page entry, looks like it is a shared page (assigned=0 and other entries are zero).

Also observed is that the kernel is providing a different RMPEntry after the fault (RMPEntry paddr 0x1001ea000).

Guest log: dma_alloc_coherent() has returned the below address, [ +0.000736] virt_addr = ffff89f2b40b8000 [ +0.000001] phys_addr = 740b8000 [ +0.000005] dma_addr = 740b8000

The snippet of the Host Dmesg log:

[ +0.017276] [KP][snp_handle_page_state_change] In gpa = 0x740b8000 op = 2 [ +0.000003] [KP][snp_make_page_shared] In gpa = 0x740b8000 [ +0.000002] [KP][snp_make_page_shared] test1 [ +0.000001] SEV-SNP: [KP][rmp_make_shared] In pfn = 0x1f52b8 [ +0.000001] SEV-SNP: [KP][rmpupdate] In pfn =0x1f52b8 [ +0.000003] SEV-SNP: [KP][rmpupdate] Out ret = 0 [ +0.000001] [KP][snp_handle_page_state_change] Out Updated gpa = 0x740b9000 op = 2 pfn 1f52b8 level 1 [ +0.000001] SEV-SNP: [KP][dump_rmpentry] In pfn = 0x1f52b8 [ +0.000001] SEV-SNP: RMPEntry [KP][dump_rmpentry] paddr 0x1f52b8000 [assigned=0 immutable=0 pagesize=0 gpa=0x0 asid=0 vmsa=0 validated=0]

[ +0.000398] vfio-pci 0000:c1:00.0: AMD-Vi: Event logged [RMP_PAGE_FAULT vmg_tag=0x0000, gpa=0x740b8784, flags_rmp=0x0000, flags=0x0020] [ +0.000013] BUG: unable to handle page fault for address: ffff993fc01ea000 [ +0.000003] #PF: supervisor write access in kernel mode [ +0.000003] #PF: error_code(0x80000003) - RMP violation [ +0.000002] PGD 15e1401067 P4D 15e1401067 PUD 100226063 PMD 100228063 PTE 80000001001ea163 [ +0.000006] SEV-SNP: [KP][dump_rmpentry] In pfn = 0x1001ea [ +0.000003] SEV-SNP: RMPEntry paddr 0x1001ea000 [assigned=1 immutable=1 pagesize=0 gpa=0x2000000001 asid=0 vmsa=0 validated=0] [ +0.000005] Oops: 0003 [#1] SMP NOPTI

tlendacky commented 1 year ago

It almost looks like the GPA to SPA is not programmed correctly for the IOMMU?

The host/hypervisor thinks that GPA 0x740b800 is at SPA 0x1f52b8000 and so 0x740b8784 would be 0x1f52b8784, but the IOMMU is saying that GPA 0x740b8784 has an SPA of 0x1001ea784? What does the Oops show for a stack trace?

kpadwal commented 1 year ago

Thanks for the quick response.

Is anything expected from the device driver to set up for IOMMU? I am expecting that sev-snp patches have already taken care of that and the device driver should not do anything about that?

What more I can debug from my side?

[ +0.000398] vfio-pci 0000:c1:00.0: AMD-Vi: Event logged [RMP_PAGE_FAULT vmg_tag=0x0000, gpa=0x740b8784, flags_rmp=0x0000, flags=0x0020] [ +0.000013] BUG: unable to handle page fault for address: ffff993fc01ea000 [ +0.000003] #PF: supervisor write access in kernel mode [ +0.000003] #PF: error_code(0x80000003) - RMP violation [ +0.000002] PGD 15e1401067 P4D 15e1401067 PUD 100226063 PMD 100228063 PTE 80000001001ea163 [ +0.000006] SEV-SNP: [KP][dump_rmpentry] In pfn = 0x1001ea [ +0.000003] SEV-SNP: RMPEntry paddr 0x1001ea000 [assigned=1 immutable=1 pagesize=0 gpa=0x2000000001 asid=0 vmsa=0 validated=0] [ +0.000005] Oops: 0003 [#1] SMP NOPTI [ +0.000003] CPU: 2 PID: 229 Comm: irq/26-AMD-Vi Not tainted 5.19.0-rc6-snp-host-d9bd54fea4d2 #7 [ +0.000004] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12C 07/19/2022 [ +0.000003] RIP: 0010:amd_iommu_int_thread+0x3b3/0x720 [ +0.000008] Code: 40 48 85 ff 0f 84 9e 01 00 00 48 83 c7 40 48 c7 c6 70 98 ed 95 e8 cd b9 e7 ff 85 c0 0f 85 22 6e 45 00 4c 89 e7 e8 bd 68 eb ff <48> c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 8b 45 b0 83 c0 10 25 [ +0.000003] RSP: 0018:ffffb0efc8813e10 EFLAGS: 00010206 [ +0.000003] RAX: 0000000000000005 RBX: ffff993fc01ea000 RCX: 0000000000000000 [ +0.000003] RDX: ffff995e8f2ac440 RSI: ffff995e8f2a0560 RDI: ffff993fc9e16108 [ +0.000002] RBP: ffffb0efc8813e80 R08: ffff995e8f2a0560 R09: ffffb0efc8813b88 [ +0.000001] R10: 000000006bb70a98 R11: 000000006bb70b00 R12: 0000000000000000 [ +0.000002] R13: 000000000000c100 R14: ffff993fc9e16000 R15: 0000000000000020 [ +0.000003] FS: 0000000000000000(0000) GS:ffff995e8f280000(0000) knlGS:0000000000000000 [ +0.000003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000002] CR2: ffff993fc01ea000 CR3: 000000012714c004 CR4: 0000000000770ee0 [ +0.000003] PKRU: 55555554 [ +0.000001] Call Trace: [ +0.000002] [ +0.000004] ? irq_finalize_oneshot.part.0+0xf0/0xf0 [ +0.000006] irq_thread_fn+0x28/0x60 [ +0.000004] irq_thread+0xe6/0x1a0 [ +0.000003] ? irq_forced_thread_fn+0x90/0x90 [ +0.000003] ? irq_thread_check_affinity+0xf0/0xf0 [ +0.000004] kthread+0xcf/0xf0 [ +0.000005] ? kthread_complete_and_exit+0x20/0x20 [ +0.000003] ret_from_fork+0x22/0x30 [ +0.000006] [ +0.000001] Modules linked in: intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif binfmt_misc kvm nls_iso8859_1 rapl wmi_bmof efi_pstore joydev input_leds cdc_ether usbnet mii ccp k10temp acpi_ipmi ipmi_si mac_hid sch_fq_codel ipmi_devintf ipmi_msghandler nfsd auth_rpcgss msr nfs_acl parport_pc lockd grace ppdev lp parport sunrpc ip_tables x_tables autofs4 mlx5_ib ib_uverbs ib_core hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd mlx5_core ahci nvme pci_hyperv_intf drm mlxfw libahci psample nvme_core i2c_piix4 tls wmi [ +0.000077] CR2: ffff993fc01ea000 [ +0.000002] ---[ end trace 0000000000000000 ]--- [ +0.149323] RIP: 0010:amd_iommu_int_thread+0x3b3/0x720 [ +0.000014] Code: 40 48 85 ff 0f 84 9e 01 00 00 48 83 c7 40 48 c7 c6 70 98 ed 95 e8 cd b9 e7 ff 85 c0 0f 85 22 6e 45 00 4c 89 e7 e8 bd 68 eb ff <48> c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 8b 45 b0 83 c0 10 25 [ +0.000007] RSP: 0018:ffffb0efc8813e10 EFLAGS: 00010206 [ +0.000006] RAX: 0000000000000005 RBX: ffff993fc01ea000 RCX: 0000000000000000 [ +0.000004] RDX: ffff995e8f2ac440 RSI: ffff995e8f2a0560 RDI: ffff993fc9e16108 [ +0.000004] RBP: ffffb0efc8813e80 R08: ffff995e8f2a0560 R09: ffffb0efc8813b88 [ +0.000004] R10: 000000006bb70a98 R11: 000000006bb70b00 R12: 0000000000000000 [ +0.000004] R13: 000000000000c100 R14: ffff993fc9e16000 R15: 0000000000000020 [ +0.000004] FS: 0000000000000000(0000) GS:ffff995e8f280000(0000) knlGS:0000000000000000 [ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000004] CR2: ffff993fc01ea000 CR3: 000000012714c004 CR4: 0000000000770ee0 [ +0.000003] PKRU: 55555554 [ +0.000068] BUG: unable to handle page fault for address: ffffb0efc8814d30 [ +0.000004] #PF: supervisor write access in kernel mode [ +0.000003] #PF: error_code(0x0002) - not-present page [ +0.000002] PGD 100000067 P4D 100000067 PUD 1001dc067 PMD 11297c067 PTE 0

[ +0.000007] Oops: 0002 [#2] SMP NOPTI [ +0.000004] CPU: 2 PID: 229 Comm: irq/26-AMD-Vi Tainted: G D 5.19.0-rc6-snp-host-d9bd54fea4d2 #7 [ +0.000005] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12C 07/19/2022 [ +0.000002] RIP: 0010:mutex_lock+0x1e/0x40 [ +0.000006] Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc e8 6d da ff ff 31 c0 65 48 8b 14 25 c0 fb 01 00 49 0f b1 14 24 75 04 41 5c 5d c3 4c 89 e7 e8 ae ff ff ff 41 5c [ +0.000004] RSP: 0018:ffffb0efc8813e38 EFLAGS: 00010246 [ +0.000003] RAX: 0000000000000000 RBX: ffffb0efc8813f28 RCX: 0000000000000000 [ +0.000003] RDX: ffff993fd1661940 RSI: ffffb0efc8815468 RDI: ffffb0efc8814d30 [ +0.000002] RBP: ffffb0efc8813e40 R08: ffffffff964669e0 R09: 00000000000003c0 [ +0.000003] R10: 000000006bb71db0 R11: 000000006bb71e18 R12: ffffb0efc8814d30 [ +0.000003] R13: ffffb0efc8814d30 R14: 0000000000000001 R15: 0000000000000046 [ +0.000002] FS: 0000000000000000(0000) GS:ffff995e8f280000(0000) knlGS:0000000000000000 [ +0.000003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000003] CR2: ffffb0efc8814d30 CR3: 000000012714c004 CR4: 0000000000770ee0 [ +0.000002] PKRU: 55555554 [ +0.000002] Call Trace: [ +0.000002] [ +0.000003] perf_event_exit_task+0x28/0x230 [ +0.000006] ? complete+0x4c/0x60 [ +0.000006] do_exit+0x353/0xb00 [ +0.000005] ? task_work_run+0x6e/0xa0 [ +0.000005] do_exit+0x343/0xb00 [ +0.000004] make_task_dead+0x5a/0x60 [ +0.000004] rewind_stack_and_make_dead+0x17/0x20 [ +0.000007] [ +0.000001] Modules linked in: intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif binfmt_misc kvm nls_iso8859_1 rapl wmi_bmof efi_pstore joydev input_leds cdc_ether usbnet mii ccp k10temp acpi_ipmi ipmi_si mac_hid sch_fq_codel ipmi_devintf ipmi_msghandler nfsd auth_rpcgss msr nfs_acl parport_pc lockd grace ppdev lp parport sunrpc ip_tables x_tables autofs4 mlx5_ib ib_uverbs ib_core hid_generic usbhid hid ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd mlx5_core ahci nvme pci_hyperv_intf drm mlxfw libahci psample nvme_core i2c_piix4 tls wmi [ +0.000074] CR2: ffffb0efc8814d30 [ +0.000002] ---[ end trace 0000000000000000 ]--- [ +0.797893] RIP: 0010:amd_iommu_int_thread+0x3b3/0x720 [ +0.797893] RIP: 0010:amd_iommu_int_thread+0x3b3/0x720 [ +0.000008] Code: 40 48 85 ff 0f 84 9e 01 00 00 48 83 c7 40 48 c7 c6 70 98 ed 95 e8 cd b9 e7 ff 85 c0 0f 85 22 6e 45 00 4c 89 e7 e8 bd 68 eb ff <48> c7 03 00 00 00 00 48 c7 43 08 00 00 00 00 8b 45 b0 83 c0 10 25 [ +0.000003] RSP: 0018:ffffb0efc8813e10 EFLAGS: 00010206 [ +0.000003] RAX: 0000000000000005 RBX: ffff993fc01ea000 RCX: 0000000000000000 [ +0.000002] RDX: ffff995e8f2ac440 RSI: ffff995e8f2a0560 RDI: ffff993fc9e16108 [ +0.000002] RBP: ffffb0efc8813e80 R08: ffff995e8f2a0560 R09: ffffb0efc8813b88 [ +0.000002] R10: 000000006bb70a98 R11: 000000006bb70b00 R12: 0000000000000000 [ +0.000002] R13: 000000000000c100 R14: ffff993fc9e16000 R15: 0000000000000020 [ +0.000002] FS: 0000000000000000(0000) GS:ffff995e8f280000(0000) knlGS:0000000000000000 [ +0.000001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000002] CR2: ffffb0efc8814d30 CR3: 000000012714c004 CR4: 0000000000770ee0 [ +0.000001] PKRU: 55555554 [ +0.000001] Fixing recursive fault but reboot is needed!

ashkalra commented 1 year ago

Once SNP is initialized, the IOMMU only does RMP enforcement, i.e., checks that all DMA I/O happens to/from shared memory (there is no additional setup needed on the IOMMU for SNP). .. [ +0.000006] SEV-SNP: [KP][dump_rmpentry] In pfn = 0x1001ea [ +0.000003] SEV-SNP: RMPEntry paddr 0x1001ea000 [assigned=1 immutable=1 pagesize=0 gpa=0x2000000001 asid=0 vmsa=0 validated=0] I don't understand why the Host RMP #PF handler is dumping (above) an incorrect entry with regard to the IOMMU RMP #PF event. The RMP #PF handler indicates a Firmware page state. But, then the gpa is non zero so...., and more importantly this does not match with the RMP entry you have setup above for DMA I/O: [ +0.000001] SEV-SNP: [KP][dump_rmpentry] In pfn = 0x1f52b8 [ +0.000001] SEV-SNP: RMPEntry [KP][dump_rmpentry] paddr 0x1f52b8000 [assigned=0 immutable=0 pagesize=0 gpa=0x0 asid=0 vmsa=0 validated=0]

Also, the IOMMU RMP fault event does not match this: (would be logged when IOMMU page table walk is done): [ +0.000398] vfio-pci 0000:c1:00.0: AMD-Vi: Event logged [RMP_PAGE_FAULT vmg_tag=0x0000, gpa=0x740b8784, flags_rmp=0x0000, flags=0x0020]

For additional debugging, can you add the following helper in your code and call it in the host RMP #PF handler, in case of a RMP #PF violation:

+void dump_all_rmpentry(void)
+{
+       struct rmpentry *e;
+       unsigned long pfn;
+
+       for (pfn = 0; pfn < 0xFFFFFFFFF; pfn++) {
+               unsigned long paddr = pfn << PAGE_SHIFT;
+
+               e = rmptable_entry(paddr);
+               if (IS_ERR(e))
+                       break;
+
+               pr_info("RMPEntry paddr 0x%llx [assigned=%d immutable=%d pagesize=%d gpa=0x%lx"
+                                       " asid=%d vmsa=%d validated=%d]\n", pfn << PAGE_SHIFT,
+                                       rmpentry_assigned(e), rmpentry_immutable(e), rmpentry_pagesize(e),
+                                       rmpentry_gpa(e), rmpentry_asid(e), rmpentry_vmsa(e),
+                                       rmpentry_validated(e));
+       }
+}
+
kpadwal commented 1 year ago

What is the way to dump the IOMMU table?

ashkalra commented 1 year ago

There is no IOMMU table, IOMMU references the RMP table.

ashkalra commented 1 year ago

I assume you are not referring to the IOMMU page tables which will be used for translating the IOVAs to SPA.

After this translation IOMMU will reference the RMP table for page ownership checks and other RMP checks (using the SPA).

It is also possible that the IOMMU page tables are incorrectly programmed here as the translation from IOVA to SPA may be incorrect.

kpadwal commented 1 year ago

Thanks, I am debugging the same. Do you have more pointers to debug that?

tlendacky commented 1 year ago

We believe the host #PF is a bug in the IOMMU driver. Please try the following patch on the host side. That should prevent the host bug, but we still need to understand why the I/O RMP page fault is occurring, but hopefully this will keep things alive long enough to further debug.

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 59f9607b34bc..171cb4bc48a0 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -672,7 +672,8 @@ static void iommu_print_event(struct amd_iommu *iommu, void *__evt)
            event[0], event[1], event[2], event[3]);
    }

-   memset(__evt, 0, 4 * sizeof(u32));
+   if (!amd_iommu_snp_en)
+       memset(__evt, 0, 4 * sizeof(u32));
 }

 static void iommu_poll_events(struct amd_iommu *iommu)
@@ -744,7 +745,8 @@ static void iommu_poll_ppr_log(struct amd_iommu *iommu)
         * To detect the hardware bug we need to clear the entry
         * back to zero.
         */
-       raw[0] = raw[1] = 0UL;
+       if (!amd_iommu_snp_en)
+           raw[0] = raw[1] = 0UL;

        /* Update head pointer of hardware ring-buffer */
        head = (head + PPR_ENTRY_SIZE) % PPR_LOG_SIZE;
ashkalra commented 1 year ago

Also, for debugging/tracing IOVA to SPA translations, please enable kernel's IOMMU tracepoint events:

# echo 0 > /sys/kernel/tracing/tracing_on
# echo > /sys/kernel/tracing/trace
# echo 1 > /sys/kernel/tracing/events/iommu/enable
# echo 1 > /sys/kernel/tracing/tracing_on

(dumping trace log)
# cat /sys/kernel/tracing/trace

This should assist in tracing IOVA to SPA mapping calls.

But this will produce a lot of trace log, probably we can add a filter for "/sys/kernel/tracing/events/iommu/map" event.
ashkalra commented 1 year ago

Along with the fix above for the Host RMP page fault, there is currently a hack needed to force iommu to map 4K pages to fix the IOMMU RMP page fault (attached below). Please test with both these patches to verify it fixes both the host and IOMMU RMP page faults and you are able to do PCIe device pass-through on SNP guest.

ashkalra commented 1 year ago
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 5b1019dab328..ec317e7c348a 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -275,7 +275,7 @@
  *
  * 512GB Pages are not supported due to a hardware bug
  */
-#define AMD_IOMMU_PGSIZES      ((~0xFFFUL) & ~(2ULL << 38))
+#define AMD_IOMMU_PGSIZES      (PAGE_SIZE)
 /* Bit value definition for dte irq remapping fields*/
 #define DTE_IRQ_PHYS_ADDR_MASK (((1ULL << 45)-1) << 6)
kpadwal commented 1 year ago

Thanks, I will validate the patches and update the result.

kpadwal commented 1 year ago

Thanks, @ashkalra, with these two patches, I don't see the RMP page fault and I am able to initialize my PCI device driver.

Do we have a plan to submit these two patches in this git hub tree branch (sev-snp-iommu-avic_5.19-rc6_v4)?

ashkalra commented 1 year ago

That's good to know.

Yes, we definitely want to submit these two patches to sev-snp-iommu-avic_5.19-rc6_v4, but as we discussed in the call yesterday, we are still discussing internally for the best fix for the IOMMU RMP page fault issue and will surely push the patches by next week. Thanks.

dcui commented 1 year ago

Hi @tlendacky @ashkalra , I'm curious about the two patches you made on Nov 21, Nov 28, 2022. Do they fix or work around a bug in the Linux AMD IOMMU software driver or in the IOMMU hardware?

tlendacky commented 1 year ago

Hi @tlendacky @ashkalra , I'm curious about the two patches you made on Nov 21, Nov 28, 2022. Do they fix or work around a bug in the Linux AMD IOMMU software driver or in the IOMMU hardware?

Those patches fix bugs in the IOMMU driver and SNP host/hypervisor support.

kpadwal commented 1 year ago

Hi @ashkalra / @tlendacky , I would like to inquire whether these patches have been merged into the branch. If they haven't, could you please provide information on the plan to get them checked-in?

ashkalra commented 1 year ago

Hi @kpadwal, we believe that this issue is implicitly resolved for SNP with UPM support (unmapped private memory). The reason for that is 2M IOMMU mappings will only ever exist for THP pages. With UPM, shared memory is allocated via malloc()/mmap() and private memory is allocated separately via memfd. Since shared THPs are allocated separately from private memory, those shared THPs can never overlap with private pages unless one of the sub-pages of that shared THP is converted to private, which will never happen with UPM. But there is another issue as the IOMMU mappings are statically done initially when qemu's memory listener calls vfio_iommu_map and after that the IOMMU mappings are not unmapped/re-mapped, so during PSC handling when the DMA buffers will be transitioned from private->shared, new PFN mappings will be setup in NPT as shared pages are allocated from a separate memory pool, but static IOMMU mappings will be stale now, as they IOMMU mappings are not getting updated/remapped. To fix this issue, we have currently implemented a solution which uses Qemu's RamDiscardManager. So, you should be able to use the current UPM branch https://github.com/AMDESE/linux/tree/upmv10-host-snp-v8-rfc for host SNP with UPM. The QEMU RamDiscardManager support is still under testing and needs to be merged in the QEMU tree being used for UPM.

kpadwal commented 1 year ago

Thanks for the information, we will verify that.