Open simondeziel opened 1 month ago
After Linux kernel 4.0+, the MAX_TAP_QUEUES
was set to DEFAULT_MAX_NUM_RSS_QUEUES
which is capped at 256
(see here). This limit hasn't changed in 6.5+ Kernel. In such a situation where we want more than 256 vCPUs, I'd suggest using the maximum amount of queues (i.e, 256
) , and drop the unix.IFF_ONE_QUEUE
flag to allow multiple vCPUs to share queues. The flags for unix.TUNSETIFF would then look like flags := unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR
. We should then have:
configureQueues := func(cpuCount int) int {
// Number of queues is the same as number of vCPUs. Run with a minimum of two queues.
queueCount := cpuCount
if queueCount < 2 {
queueCount = 2
}
if queueCount > 256 {
queueCount = 256
}
// Number of vectors is number of vCPUs * 2 (RX/TX) + 2 (config/control MSI-X).
vectors := 2*queueCount + 2
if vectors > 0 {
qemuDev["mq"] = "on"
if shared.ValueInSlice(busName, []string{"pcie", "pci"}) {
qemuDev["vectors"] = strconv.Itoa(vectors)
}
}
return queueCount
}
and
devFile := func(cpus int) (*os.File, error) {
revert := revert.New()
defer revert.Fail()
f, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
if err != nil {
return nil, err
}
revert.Add(func() { _ = f.Close() })
ifr, err := unix.NewIfreq(nicName)
if err != nil {
return nil, fmt.Errorf("Error creating new ifreq for %q: %w", nicName, err)
}
if cpus > 256 {
ifr.SetUint16(unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR)
} else {
ifr.SetUint16(unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_ONE_QUEUE | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR)
}
// Sets the file handle to point to the requested NIC interface.
err = unix.IoctlIfreq(int(f.Fd()), unix.TUNSETIFF, ifr)
if err != nil {
return nil, fmt.Errorf("Error getting TAP file handle for %q: %w", nicName, err)
}
revert.Success()
return f, nil
}
Maybe its a naive solution. Maybe it does not even work (I don't have that kind of hardware for testing ah). But maybe its worth a try... @simondeziel @mihalicyn thoughts ?
Maybe its a naive solution. Maybe it does not even work (I don't have that kind of hardware for testing ah).
tf-reserve hoodin
should get you such HW ;)
@simondeziel @mihalicyn Ok, after trying that (lxc config set v1 limits.cpu=257 && lxc start v1
), the VM seems to boot and is running according to LXD (I get no error and the instance is marked as RUNNING). But the truth is that the underlying QEMU emulation has crashed (I tried to exec a command in the instance but I always had Error: LXD VM agent isn't currently running
) after investigating the qemu.log of the instance, I realized it crashed (with a very cryptic message):
KVM internal error. Suberror: 1
extra data[0]: 0x0000000000000000
extra data[1]: 0x0000000000000400
extra data[2]: 0x0000000100000014
extra data[3]: 0x00000000000b0000
extra data[4]: 0x0000000000000000
extra data[5]: 0x0000000000000000
emulation failure
RAX=0000000000000000 RBX=0000000000069aba RCX=0000000000000100 RDX=0000000000000011
RSI=0000000000087000 RDI=000000003ea38810 RBP=00000011ffffffe8 RSP=00000011ffffffb0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
RIP=00000000000b0000 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008300 DPL=0 Reserved
GDT= 000000003f1dc000 00000047
IDT= 000000003eaf9018 00000fff
CR0=80000013 CR2=0000000000000000 CR3=000000003f401000 CR4=00000020
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d00
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
lxc monitor
did not really helped. It actually shows that everyting is OK:
Scheduler: network: tap10b7607f has been added: updating network priorities
DEBUG [2024-05-27T11:36:40Z] Starting device device=root instance=v1 instanceType=virtual-machine project=default type=disk
DEBUG [2024-05-27T11:36:40Z] UpdateInstanceBackupFile started driver=zfs instance=v1 pool=new_default project=default
DEBUG [2024-05-27T11:36:40Z] UpdateInstanceBackupFile finished driver=zfs instance=v1 pool=new_default project=default
DEBUG [2024-05-27T11:36:40Z] Skipping unmount as in use driver=zfs pool=new_default refCount=1 volName=v1
DEBUG [2024-05-27T11:36:40Z] QMP monitor started path=/var/snap/lxd/common/lxd/logs/v1/qemu.monitor
DEBUG [2024-05-27T11:36:57Z] Scheduler: virtual-machine v1 started: re-balancing
INFO [2024-05-27T11:36:57Z] Action: instance-restarted, Source: /1.0/instances/v1
DEBUG [2024-05-27T11:36:57Z] Start finished instance=v1 instanceType=virtual-machine project=default stateful=false
DEBUG [2024-05-27T11:36:57Z] onStop hook finished instance=v1 instanceType=virtual-machine project=default target=reboot
DEBUG [2024-05-27T11:36:57Z] Instance operation lock finished action=restart err="<nil>" instance=v1 project=default reusable=false
@tomponline do you have any thoughts on how to proceed?
As this isnt a roadmap item and is not an issue currently assigned to a milestone i would suggest leaving this issue for now and tackling the high priority roadmap items and any unresolved milestone bugs first. Unless you have been specifically asked to work in the issue by your manager of course.
I'll leave it then as this was just out of curiosity.
Looks like a bug in QEMU. At some point, we should check what happens with this on a newer versions and report/fix this.
In https://github.com/canonical/lxd/issues/13186#issuecomment-2014815449, it was identified that at 257 vCPU the VM will fail to start due to LXD not being able to setup the bridged NIC using
unix.TUNSETIFF
:This 256 max CPU is below QEMU's own limit of 288 vCPUs.