canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 930 forks source link

NIC acceleration configuration method prevents the use of bonds and VF-LAG #12233

Open fnordahl opened 1 year ago

fnordahl commented 1 year ago

LXD supports configuring NIC acceleration for cards that support switchdev mode and OVS hardware offload.

However, the current method of discovering which PF to allocate resources from prevents putting the PFs in a bond and making use of the VF-LAG feature: https://github.com/canonical/lxd/blob/227bc5cd75ebf2ba7b6c881f209ed5c2e6640f9b/lxd/network/network_utils_sriov.go#L385-L387

The current functionality is documented here: https://github.com/canonical/lxd/blob/227bc5cd75ebf2ba7b6c881f209ed5c2e6640f9b/doc/reference/devices_nic.md?plain=1#L209-L216

We need to replace this with some sort of configuration option that can be set per target when in a cluster.

The expected way to use this would at a high level be:

The netplan configuration could be expressed as this:

network:
    bonds:
        bond0:
            interfaces:
            - enp9s0f0np0
            - enp9s0f1np1
            macaddress: 08:c0:eb:81:6b:78
            parameters:
                mode: 802.3ad
                ...
    bridges:
        br-bond0:
            addresses:
            - 192.0.2.10/24
            interfaces:
            - bond0
            macaddress: 08:c0:eb:81:6b:78
            openvswitch: {}
    ethernets:
        enp9s0f0np0:
            match:
                macaddress: 08:c0:eb:81:6b:78
            set-name: enp9s0f0np0
            virtual-function-count: 32
            embedded-switch-mode: switchdev
            delay-virtual-functions-rebind: true
        enp9s0f1np1:
            match:
                macaddress: 08:c0:eb:81:6b:79
            set-name: enp9s0f1np1
            virtual-function-count: 32
            embedded-switch-mode: switchdev
            delay-virtual-functions-rebind: true
tomponline commented 1 year ago

We need to replace this with some sort of configuration option that can be set per target when in a cluster.

Could you elaborate a bit on how you envisage this working? Instance NICs only every start on a single cluster member/host at a time, although they can be migrated between hosts.

I'm not quite following what is changing in the circumstance you describe?

Could you give an example of a current accelerated LXD NIC device and highlight which parts are introducing the issue?

Thanks

fnordahl commented 1 year ago

We need to replace this with some sort of configuration option that can be set per target when in a cluster.

Could you elaborate a bit on how you envisage this working? Instance NICs only every start on a single cluster member/host at a time, although they can be migrated between hosts.

I'm not quite following what is changing in the circumstance you describe?

Could you give an example of a current accelerated LXD NIC device and highlight which parts are introducing the issue?

This part is the part that causes the issue: https://github.com/canonical/lxd/blob/227bc5cd75ebf2ba7b6c881f209ed5c2e6640f9b/doc/reference/devices_nic.md?plain=1#L214

enp129s0f0np0 can't both be a part of bond0 and added to br-int at the same time.

So the root of the problem is how LXD expects the user to put the PF into br-int and uses that to identify which PF to allocate VFs from.

To examplify further, what if you wanted to use resources from both PFs? Would you put both enp129s0f0np0 and enp129s0f1np1 into the same bridge?

Incidentally, the fact that the default configuration for br-int is to have fail_mode: secure you may avoid a network loop, but had you done that in any other bridge it would probably not go to well.

PF selection needs to move somewhere else.

tomponline commented 1 year ago

To examplify further, what if you wanted to use resources from both PFs? Would you put both enp129s0f0np0 and enp129s0f1np1 into the same bridge?

Yes I vaguely recall that was the original thinking.

What do you think should change in LXD? Are you thinking a NIC device config setting that specifies the acceleration.parent or something like that?

fnordahl commented 1 year ago

What do you think should change in LXD? Are you thinking a NIC device config setting that specifies the acceleration.parent or something like that?

I assume you are referring to the profile/instance configuration now, and something like acceleration.parent makes sense.

I wonder what it should refer to though. Individual nodes of a cluster may not have the exact same physical configuration, so the parent interface name may differ from host to host.

For the bond case, I guess one could call the bond whatever, so we could mandate the operator use a uniform name.

For the non-bond case though it might be more difficult.

I see that many parts of the LXD documentation refer to machine specific commands: https://github.com/canonical/lxd/blob/227bc5cd75ebf2ba7b6c881f209ed5c2e6640f9b/doc/howto/network_ovn_setup.md?plain=1#L131

Would a way be to create some type of network that can be used to map per machine interface names to a "physical network" and then refer to that in acceleration.parent ?

tomponline commented 1 year ago

Ah in that case it would likely need to be part of a member-specific config on the ovn network itself or perhaps uplink network's configuration:

https://documentation.ubuntu.com/lxd/en/latest/reference/network_ovn/#configuration-options https://documentation.ubuntu.com/lxd/en/latest/reference/network_physical/#configuration-options

This could then be used by the ovn NIC device when starting up.

tomponline commented 1 year ago

@fnordahl I've marked this as blocked as we don't have access to hardware anymore to develop/test a fix for this.

If this is something you could help us with that would be appreciated.

tomponline commented 1 year ago

@fnordahl as discussed in meeting, we'll get a partner cloud setup to work on this issue.