Support more than 256 vCPUs for VMs attached to bridges

simondeziel commented 1 month ago

In https://github.com/canonical/lxd/issues/13186#issuecomment-2014815449, it was identified that at 257 vCPU the VM will fail to start due to LXD not being able to setup the bridged NIC using unix.TUNSETIFF:

ubuntu@hoodin:~$ lxc config set v2 limits.cpu=257
ubuntu@hoodin:~$ lxc start v2
Error: Failed setting up device via monitor: Error opening netdev file for queue 256: Error getting TAP file handle for "tapd7041fbc": argument list too long
Try `lxc info --show-log v2` for more info

This 256 max CPU is below QEMU's own limit of 288 vCPUs.

gabrielmougard commented 1 month ago

After Linux kernel 4.0+, the MAX_TAP_QUEUES was set to DEFAULT_MAX_NUM_RSS_QUEUES which is capped at 256 (see here). This limit hasn't changed in 6.5+ Kernel. In such a situation where we want more than 256 vCPUs, I'd suggest using the maximum amount of queues (i.e, 256) , and drop the unix.IFF_ONE_QUEUE flag to allow multiple vCPUs to share queues. The flags for unix.TUNSETIFF would then look like flags := unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR. We should then have:

configureQueues := func(cpuCount int) int {
        // Number of queues is the same as number of vCPUs. Run with a minimum of two queues.
        queueCount := cpuCount
        if queueCount < 2 {
            queueCount = 2
        }

                if queueCount > 256 {
                         queueCount = 256
                }

        // Number of vectors is number of vCPUs * 2 (RX/TX) + 2 (config/control MSI-X).
        vectors := 2*queueCount + 2
        if vectors > 0 {
            qemuDev["mq"] = "on"
            if shared.ValueInSlice(busName, []string{"pcie", "pci"}) {
                qemuDev["vectors"] = strconv.Itoa(vectors)
            }
        }

        return queueCount
}

and

devFile := func(cpus int) (*os.File, error) {
            revert := revert.New()
            defer revert.Fail()

            f, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
            if err != nil {
                return nil, err
            }

            revert.Add(func() { _ = f.Close() })

            ifr, err := unix.NewIfreq(nicName)
            if err != nil {
                return nil, fmt.Errorf("Error creating new ifreq for %q: %w", nicName, err)
            }

                        if cpus > 256 {
                                ifr.SetUint16(unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR)
                        } else {
                                ifr.SetUint16(unix.IFF_TAP | unix.IFF_NO_PI | unix.IFF_ONE_QUEUE | unix.IFF_MULTI_QUEUE | unix.IFF_VNET_HDR)
                        }

            // Sets the file handle to point to the requested NIC interface.
            err = unix.IoctlIfreq(int(f.Fd()), unix.TUNSETIFF, ifr)
            if err != nil {
                return nil, fmt.Errorf("Error getting TAP file handle for %q: %w", nicName, err)
            }

            revert.Success()
            return f, nil
        }

Maybe its a naive solution. Maybe it does not even work (I don't have that kind of hardware for testing ah). But maybe its worth a try... @simondeziel @mihalicyn thoughts ?

simondeziel commented 1 month ago

Maybe its a naive solution. Maybe it does not even work (I don't have that kind of hardware for testing ah).

tf-reserve hoodin should get you such HW ;)

gabrielmougard commented 1 month ago

@simondeziel @mihalicyn Ok, after trying that (lxc config set v1 limits.cpu=257 && lxc start v1), the VM seems to boot and is running according to LXD (I get no error and the instance is marked as RUNNING). But the truth is that the underlying QEMU emulation has crashed (I tried to exec a command in the instance but I always had Error: LXD VM agent isn't currently running) after investigating the qemu.log of the instance, I realized it crashed (with a very cryptic message):

KVM internal error. Suberror: 1
extra data[0]: 0x0000000000000000
extra data[1]: 0x0000000000000400
extra data[2]: 0x0000000100000014
extra data[3]: 0x00000000000b0000
extra data[4]: 0x0000000000000000
extra data[5]: 0x0000000000000000
emulation failure
RAX=0000000000000000 RBX=0000000000069aba RCX=0000000000000100 RDX=0000000000000011
RSI=0000000000087000 RDI=000000003ea38810 RBP=00000011ffffffe8 RSP=00000011ffffffb0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
RIP=00000000000b0000 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008300 DPL=0 Reserved
GDT=     000000003f1dc000 00000047
IDT=     000000003eaf9018 00000fff
CR0=80000013 CR2=0000000000000000 CR3=000000003f401000 CR4=00000020
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d00
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

lxc monitor did not really helped. It actually shows that everyting is OK:

Scheduler: network: tap10b7607f has been added: updating network priorities 
DEBUG  [2024-05-27T11:36:40Z] Starting device                               device=root instance=v1 instanceType=virtual-machine project=default type=disk
DEBUG  [2024-05-27T11:36:40Z] UpdateInstanceBackupFile started              driver=zfs instance=v1 pool=new_default project=default
DEBUG  [2024-05-27T11:36:40Z] UpdateInstanceBackupFile finished             driver=zfs instance=v1 pool=new_default project=default
DEBUG  [2024-05-27T11:36:40Z] Skipping unmount as in use                    driver=zfs pool=new_default refCount=1 volName=v1
DEBUG  [2024-05-27T11:36:40Z] QMP monitor started                           path=/var/snap/lxd/common/lxd/logs/v1/qemu.monitor
DEBUG  [2024-05-27T11:36:57Z] Scheduler: virtual-machine v1 started: re-balancing 
INFO   [2024-05-27T11:36:57Z] Action: instance-restarted, Source: /1.0/instances/v1 
DEBUG  [2024-05-27T11:36:57Z] Start finished                                instance=v1 instanceType=virtual-machine project=default stateful=false
DEBUG  [2024-05-27T11:36:57Z] onStop hook finished                          instance=v1 instanceType=virtual-machine project=default target=reboot
DEBUG  [2024-05-27T11:36:57Z] Instance operation lock finished              action=restart err="<nil>" instance=v1 project=default reusable=false

gabrielmougard commented 1 month ago

@tomponline do you have any thoughts on how to proceed?

tomponline commented 1 month ago

As this isnt a roadmap item and is not an issue currently assigned to a milestone i would suggest leaving this issue for now and tackling the high priority roadmap items and any unresolved milestone bugs first. Unless you have been specifically asked to work in the issue by your manager of course.

gabrielmougard commented 1 month ago

I'll leave it then as this was just out of curiosity.

mihalicyn commented 1 month ago

Looks like a bug in QEMU. At some point, we should check what happens with this on a newer versions and report/fix this.

canonical / lxd

Support more than 256 vCPUs for VMs attached to bridges #13484