Memory performance drop on the latest SNP kernel

mmisono commented 3 months ago

I found that the latest kernel (or OVMF?) has some performance issues compared to the 6.6 kernel. For example, on a SNP guest, (1) Intel MLC reports ~40ns longer latency, and (2) RamSpeed (Integer add) only shows 64% performance compared to the previous. A normal VM does not have this issue.

I use the same guest (Linux 6.8) for the test and a VM has 8vCPU, 64GB memory. I also find that the latest host kernel and qemu version take much more time to boot a SNP guest, especially preallocation case. I'm not sure that is related to the issue.

6.10-rc7 (kvm-next commit 332d2c1d7)

#./mlc

Intel(R) Memory Latency Checker - v3.11a
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         169.3

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      73182.9
3:1 Reads-Writes :      105536.8
2:1 Reads-Writes :      113937.9
1:1 Reads-Writes :      113943.7
Stream-triad like:      92945.8

# phoronix-test-suite run pts/ramspeed # choose to run Add/Integer

RAMspeed SMP 3.5.0: 
    pts/ramspeed-1.4.3 [Type: Add - Benchmark: Integer]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [12:00 UTC]
        Started Run 1 @ 11:58:33
        Started Run 2 @ 11:59:10
        Started Run 3 @ 11:59:47
        Started Run 4 @ 12:00:28 *
        Started Run 5 @ 12:01:11 *
        Started Run 6 @ 12:01:54 *
        Started Run 7 @ 12:02:36 *
        Started Run 8 @ 12:03:20 *
        Started Run 9 @ 12:04:06 *
        Started Run 10 @ 12:04:50 *
        Started Run 11 @ 12:05:34 *
        Started Run 12 @ 12:06:20 *

    Type: Add - Benchmark: Integer:
        56815.39
        52502.91
        50039.92
        44957.92
        46092.53
        45719.11
        45475.91
        42174.28
        44775.08
        43696.13
        42649.23
        41598.12

    Average: 46374.71 MB/s
    Deviation: 9.80%
    Samples: 12

I use the current snp-latest branch of qemu and ovmf. I also see a similar performance on the 6.9 kernel (the current snp-host-latest). I could not find the 6.8 kernel to test.

6.6 kernel

Intel(R) Memory Latency Checker - v3.11a
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         132.1

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      75637.0
3:1 Reads-Writes :      107827.4
2:1 Reads-Writes :      115660.3
1:1 Reads-Writes :      114199.8
Stream-triad like:      96093.4

RAMspeed SMP 3.5.0:
    pts/ramspeed-1.4.3 [Type: Add - Benchmark: Integer]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 7 Minutes [11:47 UTC]
        Started Run 1 @ 11:41:30
        Started Run 2 @ 11:42:01
        Started Run 3 @ 11:42:32

    Type: Add - Benchmark: Integer:
        73743.37
        73565.26
        72851.9

    Average: 73386.84 MB/s
    Deviation: 0.64%

Environment

AMD EPYC 9334
ccp 0000:04:00.5: SEV-SNP API:1.55 build:30

mmisono commented 3 months ago

I found this text in the REAMDE

The latest upstream version guest_memfd (which the SNP KVM support relies on) no longer supports 2MB hugepages for backing guests. There are discussions on how best to re-enable this support, but in the meantime SNP guests will utilize 4K pages for private guest memory. Please keep this in mind for any performance-related testing/observations.

I think this explains the boot time increase? But still, the latency and memory bandwidth performance drop seems too big to me

mmisono commented 3 months ago

I found the 6.8 kernel to test. This seems sane.

6.8 kernel

Intel(R) Memory Latency Checker - v3.11a
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         132.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      74972.8
3:1 Reads-Writes :      107786.3
2:1 Reads-Writes :      115812.5
1:1 Reads-Writes :      114065.0
Stream-triad like:      96318.2

RAMspeed SMP 3.5.0:
    pts/ramspeed-1.4.3 [Type: Add - Benchmark: Integer]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [14:26 UTC]
        Started Run 1 @ 14:24:27
        Started Run 2 @ 14:24:58
        Started Run 3 @ 14:25:28

    Type: Add - Benchmark: Integer:
        73552.51
        74612.37
        72918.95

    Average: 73694.61 MB/s
    Deviation: 1.16%

mdroth commented 3 months ago

The snp-latest/snp-host-latest branches have an in-development patch to enable support for 2MB THP pages. This patch is still in development and not yet upstream in 6.11-rc1. So you're essentially comparing 4K guest performance to 2MB guest performance, which I think will explain much of the performance delta. Hopefully with 6.12 the THP support will be upstream as well. You can try applying the following patch on top of 6.11-rc1 to confirm whether or not this accounts for the differences you are seeing versus earlier versions of snp-host-latest kernel used by this repo:

https://github.com/mdroth/linux/commit/d641eb88f61b57fe9a4522ea8eb1865fcb727d6e https://github.com/mdroth/linux/commits/snp-thp-611rc1/

Additionally, the QEMU patches that went upstream now default to setting all guest memory private via the KVM_SET_MEMORY_ATTRIBUTES KVM ioctl. This allows optimizations for avoiding lots of KVM_SET_MEMORY_ATTRIBUTES calls during guest run-time. The downside is that initializing that data structure that tracks the state of all those pages will take a long time for a larger guest. This has already been discussed with maintainers and we have plans on optimizing that data structure to reduce the guest startup time for these cases. It's less clear when this optimization will be implemented since it potentially overlaps with other developments, but that's also tentatively aimed at 6.12.

It's great to have some analysis of the upstream code, but for the above reasons we'd recommend sticking with snp-host-latest branch for any sort of performance comparisons until these other bits are upstream.

mdroth commented 3 months ago

I forgot to mention with that THP patch applied, guests will still default to 4K as they would with current upstream. To enable THP you need to set a KVM module parameter beforehand:

# enable 2MB THP pages
echo 1 >/sys/module/kvm/parameters/gmem_2m_enabled
# disable 2MB THP pages
echo 0 >/sys/module/kvm/parameters/gmem_2m_enabled

This is mainly to allow for easier performance comparisons and testing while the support is still in-development. In the future that module parameter will likely no longer be needed.

mmisono commented 3 months ago

@mdroth Thanks for the info. Unfortunately 6.11 kernel does not work on my environment for now due to lacking enough ZFS updates but I'll try later.

The latest upstream version guest_memfd (which the SNP KVM support relies on) no longer supports 2MB hugepages for backing guests. There are discussions on how best to re-enable this support

Could you point me to the discussion thread if you know? I was wondering what is the reason behind of the design.

It's great to have some analysis of the upstream code, but for the above reasons we'd recommend sticking with snp-host-latest branch for any sort of performance comparisons until these other bits are upstream.

btw, just to make sure, I observe this performance issue on the current snp-latest (6.9.0-rc7) but not for the previous 6.8. I think I'll use 6.8 kernel for the time being.

mdroth commented 3 months ago

Even with current 6.9-based snp-host-latest, you still need to explicitly enable 2MB hugepage support:

# enable 2MB THP pages
echo 1 >/sys/module/kvm/parameters/gmem_2m_enabled
# disable 2MB THP pages
echo 0 >/sys/module/kvm/parameters/gmem_2m_enabled

This was done so it matches upstream behavior by default, but can still have THP enabled as an experimental performance option.

Most of the original discussion behind temporarily-backing out the THP support happened in this thread:

https://lore.kernel.org/kvm/20231027182217.3615211-18-seanjc@google.com/#r

The THP patch I mentioned (https://github.com/mdroth/linux/commit/d641eb88f61b57fe9a4522ea8eb1865fcb727d6e) addresses some of the issues brought up in that discussion. I'm not aware of any other discussions around THP other than ones we've had during the weekly PUCK (Periodic Upstream Call for KVM) call with the KVM maintainers. The KVM maintainers are aware of the approach we are taking in that patch, and that we will be pushing to get something similar upstream.

mmisono commented 3 months ago

I applied the patch (https://github.com/mdroth/linux/commit/d641eb88f61b57fe9a4522ea8eb1865fcb727d6e.patch) to the current kvm-next (332d2c1d7), and enabled /sys/module/kvm/parameters/gmem_2m_enabled.

Now I get a similar performance as the previous.

Set the number of hugepages to 4000

Intel(R) Memory Latency Checker - v3.11a
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0         132.2

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      75534.6
3:1 Reads-Writes :      107652.5
2:1 Reads-Writes :      115554.7
1:1 Reads-Writes :      113854.6
Stream-triad like:      95657.4

RAMspeed SMP 3.5.0:
    pts/ramspeed-1.4.3 [Type: Add - Benchmark: Integer]
    Test 1 of 1
    Estimated Trial Run Count:    3
    Estimated Time To Completion: 2 Minutes [16:36 UTC]
        Started Run 1 @ 16:35:13
        Started Run 2 @ 16:35:44
        Started Run 3 @ 16:36:14

    Type: Add - Benchmark: Integer:
        73924.06
        71808.61
        72051.35

    Average: 72594.67 MB/s
    Deviation: 1.59%

Enabling gmem_2m_enabled also seems to reduce the boottime.

mmisono commented 3 months ago

Hi @mdroth

Thank you very much for the pointer to the thread. This is really useful. I also observed an increase in boot time on the latest kernel even with gmem_2m_enabled=1. For example, booting with 256GB memory and 8vCPU SNP VM takes 21 sec on 6.10 kernel but only takes 10 secs on 6.8 kernel (with prealloc=off). And I observed that the most of time spent in the QEMU before starting vPCUs. I think this text explains the main reason:

Additionally, the QEMU patches that went upstream now default to setting all guest memory private via the KVM_SET_MEMORY_ATTRIBUTES KVM ioctl. This allows optimizations for avoiding lots of KVM_SET_MEMORY_ATTRIBUTES calls during guest run-time. The downside is that initializing that data structure that tracks the state of all those pages will take a long time for a larger guest. This has already been discussed with maintainers and we have plans on optimizing that data structure to reduce the guest startup time for these cases. It's less clear when this optimization will be implemented since it potentially overlaps with other developments, but that's also tentatively aimed at 6.12.

Do you also know a discussion thread regarding this? Also, I was wondering how much having many KVM_SET_MEMORY_ATTRIBUTES calls affects the performance during runtime.

AMDESE / AMDSEV