Closed markrattray closed 1 year ago
I have tried disabling tdp_mmu
, and the ISO still quickly freezes during boot:
https://github.com/canonical/lxd/issues/11520
The server does have NIC bond enabled which the macvlan interfaces are using, so will try disable that and see if it helps.
disabling NIC bond didn't help.
Well I've discovered that stipulating CPU Passthrough in raw.qemu
gets around this issue and I'm actually able to get to the Install Windows screen, and all the way through to a working VM.
Here is some more info from the physical host:
somehost:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
Stepping: 4
BogoMIPS: 4800.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fa
ult epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 1.3 MiB (40 instances)
L1i: 1.3 MiB (40 instances)
L2: 40 MiB (40 instances)
L3: 55 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
tdp_mmu disablement is still needed
What did you set raw.qemu to to get it to work?
Hi
Per WS2022 instance on the Dell R740xd with Intel 6148 CPU, requires raw.qemu: -cpu host
to stop the ISO (original or distrobuilder) or new instance from image freezing a few seconds into boot.
This isn't needed for WS2019 ISO on this host. I'm able deploy a WS2019 instance via ISO (distrobuilder) without this.
Older hosts with Intel E5-2680 v2 do not need this with WS2022 instances.
For any of our physical hosts so far, the Intel PET flag needs to be disabled to stop the frequent random freezing... using the modprobe method:
https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0#Older_Hardware_and_New_5.15_Kernel
Do you think it's possible to set this per instance rather than on the host using raw.qemu
? I've not been able to find anything, and what I've tried was incorrect and the VM wouldn't start.
I don't know whether this is needed for WS2019 because I'm doing a clean migration to new WS2022 instances, with new domains and conventions.
The Proxmox forum has a lot more posts on WS2022 and your link to one of their posts via another Github LXD issue lead me to the tdp_mmu
work-around.
What do you mean "per-instance", as raw.qemu
is a per-instance setting?
In the instance config.
In the instance config.
Yes you can do that using lxc config set <instance> raw.qemu=...
sorry my question was whether the tdp_mmu
disablement could be done via raw.qemu
as well. I did try:
raw.qemu: -cpu host,kvm.tdp_mmu=N
also:
raw.qemu: -cpu host,kvm.tdp_mmu=off
but they would not start with the kvm.tdp_mmu
key.
Looks this tdp_mmu
issue will be fixed in kernel 6.2 according to this:
https://gitlab.com/qemu-project/qemu/-/issues/1198
but there is more to it that I also need raw.qemu: -cpu host
on another processor.
My understanding is that LXD uses the equivalent of -cpu host
by default anyway, so its odd you're needing to pass it explicitly.
I suspect its one of the extensions that is being disabled by doing that.
Can you try setting lxc config set <instance> migration.stateful=true
and seeing if that helps, as that disables some of the extensions.
Possibly hv_passthrough or topoext
sorry my question was whether the
tdp_mmu
disablement could be done viaraw.qemu
as well. I did try:raw.qemu: -cpu host,kvm.tdp_mmu=N
also:raw.qemu: -cpu host,kvm.tdp_mmu=off
but they would not start with thekvm.tdp_mmu
key.
LXD currently doesn't support custom CPU flags.
sorry my question was whether the
tdp_mmu
disablement could be done viaraw.qemu
as well. I did try:raw.qemu: -cpu host,kvm.tdp_mmu=N
also:raw.qemu: -cpu host,kvm.tdp_mmu=off
but they would not start with thekvm.tdp_mmu
key.LXD currently doesn't support custom CPU flags.
Well only by raw.qemu ;)
Well only by raw.qemu ;)
Since the raw.qemu
flags are appended, will raw.qemu="-cpu host,kvm.tdp_mmu=N"
override the fixed -cpu host
flag? I don't know how QEMU handles duplicate flags.
Trying your suggestions @tomponline
@tomponline already better with the ISO. Got to the first screen where you choose the language an input. Couldn't get this far withou that -cpu host
setting. Now installing as well.
@tomponline that worked, WS2022 desktop installed and now running Windows Update to put some stress on it.
I will try an existing lxd image which has been working since finding raw.qemu: -cpu host
.
@monstermunchkin I'm only setting raw.qemu: -cpu host
to get the WS2022 ISO/instances to boot, without it they will freeze a few seconds into the boot process on this CPU. The kvm.tdp_mmu
append was an experiment to see whether I could disable the feature/flag at the instance level instead of the host level. I think I'm way off but it was worth a try.
@tomponline all seems fine with the setting: migration.stateful=true
on both ISO and instance.
I think I'm beginning to understand what's going on.
Apparently the tdp_mmu
feature is Intel EPT not PET as I got mixed up with.
EPT (lowercase) is listed twice in the Flags section of the lscpu
output above.
So I tried setting raw.qemu: -cpu host,ept=off
and got this when trying to start the VM:
qemu-system-x86_64: can't apply global host-x86_64-cpu.ept=off: Property 'host-x86_64-cpu.ept' not found
So host-x86_64-cpu
is the QEMU CPU and by setting CPU passthrough (-cpu host
) Windows is getting the correct set of flags. Unfortunately this does not solve the EPT feature crashes that come later then the VM is up. I gather that what Thomas suggested disables a set of QEMU CPU flags (properties) that are the problem in this scenario, and allow the VM to boot as well.
For the tdp_mmu / EPT feature, this is more tricky because I am now passing the host CPU flags which contains EPT ones, and would need the QEMU CPU to disable these features. TBH the kernel 6.2 option seems to be the best one for this because current 5.15 and 5.19 ones doesn't help: https://gitlab.com/qemu-project/qemu/-/issues/1198
Is there anything left to do on this issue or can it be closed now?
Hi
Well I'm not certain sorry...
In the one hand I see that you good dev folks have the impression that -cpu host
is being passed through, but perhaps it's not, because on this CPU I cannot get WS2022 to boot without specifying it or your migration.stateful
suggestion in the instance conf.
In the other, I see that these could be QEMU and kernel bugs and need to be dealt with upstream.
Depends really on whether you want to lift some stones to see whether this will come back to bite you somewhere else more important.... e.g. with a big customer.
For me I have 2 workarounds, as we've discovered so my need is catered for. Up to you.
Hope you all have a great weekend !
Yes it is passed through here:
https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_qemu.go#L1396-L1413
Given that setting migration.stateful fixes it its likely to be this one that was causing problems:
https://github.com/canonical/lxd/blob/main/lxd/instance/drivers/driver_qemu.go#L1401
Thanks for the info. So if I understand that correctly migration.stateful
feature only disables the hv_time
flag?
Regardless now, this can be closed.
Yes that's correct, so perhaps you manually passing the host CPU flag also wiped it out.
I've got similar issues with running Windows on my CPU (Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
)
I found this helped:
echo N | sudo tee /sys/module/kvm/parameters/tdp_mmu
I've got similar issues with running Windows on my CPU (
Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
)I found this helped:
echo N | sudo tee /sys/module/kvm/parameters/tdp_mmu
Good afternoon. Thanks for the info. Just repeating to collate if someone's reading just this part:
raw.qemu -cpu host
or your instance config suggestion migration.stateful
to get WS2022 ISO or instance to boot.Agreed, this is what I have to use too.
Good morning @tomponline
I see that kernel 6.2 has been released to Ubuntu 22.04 yesterday which might make this TDP workaround redundant. I cannot test in the next week because I'm taking a break but will plan to try thereafter.
Hope you have a great weekend!
Good morning @tomponline
On all LXD hosts, I have upgraded to kernel 6.2 via 22.04 HWE and re-enabled TDP. It's only been a day so a bit soon to tell although normally we'd see some WS2022 VMs fatally stopping.
CPU flags are still an issue on the Dell R740xd host with Intel 6148 CPU, so one of the following instance configs are still required:
raw.qemu -cpu host
migration.stateful
Required information
Issue description
Linux VM, even a desktop are fine.
Freshly rebuilt and repurposed Dell R740xd host dedicated to LXD. Now running a couple of Ubuntu containers just fine.
Tried to deploy a Windows Server 2022 VM using an imported image from obtained another server, but freezes during startup.
Thinking that the image had been corrupted during transit, I tried to create a completely new image by booting off a new ISO downloaded directly to the site from Microsoft, but within a few seconds of the ISO booting it just freezes as well. I have also generated an ISO via distrobuilder, with the same freeze result.
Installed linux-image-5.15.0-76-generic on host and booted into it but no difference. This kernel version is currently working with another host and Windows VMs.
Steps to reproduce
lxc init ws2022std-image-template --empty --vm -c limits.cpu=4 -c limits.memory=6GiB -c security.secureboot=false -d root,size=60GiB -d winiso,boot.priority=10,source=/iso/ws2022std-lxd.iso,type=disk
lxc start ws2022std-image-template
lxc list
status will show errorInformation to attach
dmesg
lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue) lxd-issues_xxxx_lxc-monitor.txt