canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

Cannot override a single device config key #9728

Closed Kramerican closed 2 years ago

Kramerican commented 2 years ago

LXD v4.2 Host: Ubuntu Focal Container: Ubuntu Focal

Applying a new network limit to a container by editing the profile has zero effect. Restarting the container with lxc stop/start/restart does not apply the new ingress/egress limit.

Reapplying the profile does not work either.

Changing to an entirely different profile does nothing also.

Creating a new container with the profile applies the new network limit.

It is as if the network ingress/egress limits get "locked in" when the container is created. We are naming the network interface with volatile.eth0.host_name - could that be responsible?

I seem to remember seeing this working without a problem in earlier versions of LXD. Has something changed here?

Lots more details can be provided upon request, I'll leave this issue barebones for now in case this is by design or I am missing something obivious here.

stgraber commented 2 years ago

What does the 'lxc config show --expanded' for the instance look like?

Kramerican commented 2 years ago

I can mention that I see this on v4.19 and 4.2 which is what we have in production here. I see the same behavior across many hosts.

architecture: x86_64
config:
  boot.autostart: "false"
  image.description: Ubuntu Focal LEMP PHP 8.0
  limits.cpu: "10"
  limits.cpu.allowance: 10%
  limits.kernel.memlock: "67108864"
  limits.memory: 5GB
  limits.memory.enforce: hard
  linux.kernel_modules: ip_tables,ip6_tables
  security.nesting: "true"
  volatile.base_image: 4c4e30cd2126fe01cbc63e2438fd28710f93262d7671a5f19968458b12a2f56c
  volatile.eth0.host_name: veXcheckmila
  volatile.eth0.hwaddr: 00:16:3e:89:7d:65
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    host_name: veXcheckmila
    limits.egress: 800Mbit
    limits.ingress: 800Mbit
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    limits.read: 4000iops
    limits.write: 1000iops
    path: /
    pool: lxd
    size: 50GB
    type: disk
  tun:
    path: /dev/net/tun
    type: unix-char
ephemeral: false
profiles:
- largeprofile
stateful: false
description: ""
tomponline commented 2 years ago

Can you show the same command without '--expanded' flag?

tomponline commented 2 years ago

You should not modify 'volatile.eth0.host_name' instead set 'hostname' on the device config.

Kramerican commented 2 years ago

@tomponline the way this is set is with:

lxc config device set containername eth0 host_name=somename

So, we are not setting that key explicitly and are indeed just setting hostname on the device.

without --expanded:

architecture: x86_64
config:
  image.description: Ubuntu Focal LEMP PHP 8.0
  security.nesting: "true"
  volatile.base_image: 4c4e30cd2126fe01cbc63e2438fd28710f93262d7671a5f19968458b12a2f56c
  volatile.eth0.host_name: veXcheckmila
  volatile.eth0.hwaddr: 00:16:3e:89:7d:65
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth0:
    host_name: veXcheckmila
    limits.egress: 800Mbit
    limits.ingress: 800Mbit
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
ephemeral: false
profiles:
- largeprofile
stateful: false
description: ""
stgraber commented 2 years ago

What does tc qdisc show look like?

tomponline commented 2 years ago

This shows the limit is applied in the instance config, and not from the profile, so it will override any profile config.

To my knowledge you cannot partially override a device's config, its all or nothing.

webdock-io commented 2 years ago

@tomponline OK so what you are saying that because we set the host name individually on a container, that "locks in" the eth0 config and makes it so that profiles cannot change the limits? This should make things clearer. This is LXD v4.21:

# created a brand new profile and set some stuff
# lxc profile show testbandwidth
config:
  boot.autostart: "false"
  limits.cpu: "4"
  limits.cpu.allowance: 5%
  limits.kernel.memlock: "67108864"
  limits.memory: 2GB
  limits.memory.enforce: hard
  linux.kernel_modules: ip_tables,ip6_tables
description: ""
devices:
  eth0:
    limits.egress: 50Mbit
    limits.ingress: 50Mbit
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    limits.read: 4000iops
    limits.write: 1000iops
    path: /
    pool: lxd
    size: 25GB
    type: disk
name: testbandwidth
used_by: []

# Create a new container, set the veth hostname
lxc launch ubuntu:focal testcontainer --profile testbandwidth
lxc config device override testcontainer eth0 host_name=myvethname

# config show - note how the ingress and egress is listed here as well which is what I believe you are talking about in your reply
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20211129)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20211129"
  image.type: squashfs
  image.version: "20.04"
  volatile.apply_template: create
  volatile.base_image: a8402324842148ccfcbacbc69bf251baa9703916593089f0609e8d45e3185bff
  volatile.eth0.hwaddr: 00:16:3e:c9:96:63
  volatile.idmap.base: "0"
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
devices:
  eth0:
    host_name: myvethname
    limits.egress: 50Mbit
    limits.ingress: 50Mbit
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
ephemeral: false
profiles:
- testbandwidth
stateful: false
description: ""

And indeed, after setting a new profile, the config shows the same - which makes sense, sort of - except I'd really just expect the override to just set the host_name and not all the other eth0 settings:

...
devices:
  eth0:
    host_name: myvethname
    limits.egress: 50Mbit
    limits.ingress: 50Mbit
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
ephemeral: false
profiles:
- largeprofile
...

But, this is a bit of a catch-22 for me. We need to set the hostname on a container by container basis, and by touching that we "lock in" any other settings for eth0 making it impossible for a profile change to change anything already set. Hmm...

What would be a better way of doing this, if any, if I want to be able to just change profiles for a container? Would I need to unset the eth0 config and then apply the profile, or something?

I guess this is not a bug after all as there is some underlying reason why overriding a single key makes it so that all other interface settings are transferred from the profile to the instance config?

tomponline commented 2 years ago

But, this is a bit of a catch-22 for me. We need to set the hostname on a container by container basis, and by touching that we "lock in" any other settings for eth0 making it impossible for a profile change to change anything already set. Hmm...

Yes that is correct, at this time, the config for a particular device is treated as a single config set and cannot be partially overridden.

See:

https://github.com/lxc/lxd/blob/04007503de66d4fc1aaa788667972f7978ce2aaf/lxd/db/profiles.go#L331-L351

@stgraber is there a particular reason why device config cannot be partially overridden? Is this perhaps due to complexities around having different device types of the same name in the profile and the overridden config?

webdock-io commented 2 years ago

@tomponline Thank you for confirming.

@stgraber It would be exceedingly great if it were possible to just override a few select properties of a device.

An "unset" / clearing option for device property overrides would be quite excellent as well.

As this is something which is not possible at the moment - how do I make sure that after a profile change, my container gets the new device settings applied? I have a hard time figuring out a sane approach here looking at the CLI options. I will experiment with a few things, but if you have a suggestion at this time I'm all ears.

stgraber commented 2 years ago

Hey there,

It's indeed quite intentional. I remember spending a few days/weeks way at the beginning of LXD thinking about that. @hallyn and @tych0 may remember me rambling about all of that while walking through the streets of Ghent, Belgium :)

I was basically debating a few models:

Some problematic situations for that last case which I had in mind:

In practice, our current approach has been working well enough, though we've certainly had some edge cases that we've had to workaround. The initial one was the persistence of the MAC address. Setting the hwaddr property would have required each instance to have an instance-local device which was quite impractical, so we introduced the volatile config key space to have LXD store data in there. This has since been expanded a fair bit to store all kinds of temporary data with various level of persistence.

Coming back to this issue, maybe we should focus on the problem you're actually trying to solve. What are you using the host_name for exactly?

In general, pinning the host_name to any value is problematic as we can't be 100% sure that any given name would be freed up properly by the kernel in a timely manner. So even if it was possible to reliably set that name somehow, you'd most likely get instances failing to restart every so often.

Maybe we can make configuring the way we generate the host veth name configurable so that you can choose for it to be based on the MAC or use a different prefix/template. This would be fine to put in a profile and may satisfy your needs.

webdock-io commented 2 years ago

@stgraber Thank you for the thorough explanation which makes a lot of sense. I understand the challenges here and the rationale.

To loop back to my use case and what we can do ... Now, this is a design which some of our network boffins came up with, so I cannot speak to the absolute specifics or if there is "a better way" but essentially speaking this is all to facilitate targeting an interface on our bridge in order to allow IPtables to do ip anti-spoof filtering.

So, what we do is we name the network device with host_name and this means we can then on the host firewall have iptables rules such as (UFW syntax):

-I ufw-before-forward -m physdev --physdev-in veXexample --physdev-is-bridged ! -s 45.148.28.248/32 -j DROP
-I ufw-before-forward -m physdev --physdev-in veXexample --physdev-is-bridged -m mac ! --mac-source 00:16:3e:b0:5d:07 -j DROP

These rules would then apply to the container named "example" where we've named the interface with lxc config device set example eth0 host_name=veXexample

There is an internal push here of which I am aware to shift all this to our hardware devices in the datacenter(s) so this may all become moot in the future, but for now this is how we are handling this.

In any case - we need this naming of the nic in order to facilitate this filtering. This now has the side-effect as discussed that any ingress/egress limits set in the first profile applied to the container will be "set in stone". Later, if the client changes their profile they will not be receiving the new port speed as defined in the profile.

All I really need at this point is to solve this issue so our clients get what they are paying for, as this is a real bug in our platform at this time :)

I haven't done much testing, but the approach I am favoring at this point is simply to read out the ingress/egress limit as defined in the new profile I want to apply to a container. Once I have those numbers, simply call lxc and override that explicitly for the container.

That doesn't solve the scenario where we might want to change the ingress/egress for a given profile for some reason and want that to apply to all containers on that profile... Here I'm guessing I'll have to script my way out of it and iterate through every container on that profile and set the new limits.

Thoughts? :)

tomponline commented 2 years ago

@webdock-io LXD has built-in support for filtering by MAC/IP using the NIC device security.mac_filtering and security.ipv{n}_filtering settings (see https://linuxcontainers.org/lxd/docs/master/instances/#nic-bridged). However as you're using an external bridge as the NIC parent, you would need to tell LXD (via device level config) what IPs are allowed by setting ipv{n}.address on the NIC - which would require overriding the profile's NIC config again (introducing the same issue as specifying the host_name).

As an aside, is your IP filtering also covering spoofed IPs embedded within an ARP or NDP advert to prevent an instance from advertising it has an IP it shouldn't?

Here I'm guessing I'll have to script my way out of it and iterate through every container on that profile and set the new limits.

That would certainly be an approach as the instance config specifies which profiles have been applied to it.

webdock-io commented 2 years ago

@tomponline

I think way back when my network team set up this filtering and our bridges they were not even aware that LXD could manage ip spoofing and MAC filtering for us if we used the LXD managed bridge. Are you telling me that if we use the managed LXD Bridge it will also take care of embedded ARP/NDP spoofs? If so, what methods do you use to accomplish this?

I believe we use ebtables much in the same way as iptables for ARP, but I'm not sure we are doing anything for ipv6 NDP

This is probably veering into "general questions and comments" territory best suited for the forums, but I'll ask them here anyway for completeness:

Looking at the documentation for nic-bridged it seems to me you can only add a single IP address to the device config if you want a static IP? We need to be able to assign multiple IPs and ranges so we are injecting static network config directly into the container, which provides a lot of flexibility (with granted, somewhat more overhead than if we just had the bridge manage IPs)

Anyway, as you pointed out, even if we can assign multiple IPs or ranges we'd still have the issue with this overriding the profile and causing complexities in managing network device settings in a scriptable/dynamic fashion.

To return to the solutions suggested by @stgraber

In general, pinning the host_name to any value is problematic as we can't be 100% sure that any given name would be freed up properly by the kernel in a timely manner. So even if it was possible to reliably set that name somehow, you'd most likely get instances failing to restart every so often.

Maybe we can make configuring the way we generate the host veth name configurable so that you can choose for it to be based on the MAC or use a different prefix/template. This would be fine to put in a profile and may satisfy your needs.

I can mention that we have very strict validation on instance names and they are guaranteed to be unique and follow a format (and length) which satisfies the system requirements. We have never seen an instance fail to restart due to an issue with setting the host_name - this is across dozens of hosts, thousands of containers over a period of a couple of years. So in our case this does not seem to be an issue.

It would be quite excellent to get some mechanism whereby we can set the host name (or reliably predict it from some other attribute) without it causing the device override. However, I have made workarounds to the current issue in our CLI tools already so they will now correctly reapply any ingress/egress limits. So my immediate issue is solved by some additional code on my end - but that does not mean that other users might encounter this behavior in other regards. I leave it up to you to decide if this is a priority or not :)

Thank you for your input and attention on this issue, greatly appreciated.

stgraber commented 2 years ago

I think way back when my network team set up this filtering and our bridges they were not even aware that LXD could manage ip spoofing and MAC filtering for us if we used the LXD managed bridge. Are you telling me that if we use the managed LXD Bridge it will also take care of embedded ARP/NDP spoofs? If so, what methods do you use to accomplish this?

I believe we use ebtables much in the same way as iptables for ARP, but I'm not sure we are doing anything for ipv6 NDP

We do it with some very very long ebtables rules ;)

https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_xtables.go#L1003 is where most of that is generated.

Looking at the documentation for nic-bridged it seems to me you can only add a single IP address to the device config if you want a static IP? We need to be able to assign multiple IPs and ranges so we are injecting static network config directly into the container, which provides a lot of flexibility (with granted, somewhat more overhead than if we just had the bridge manage IPs)

Additional addresses and subnets are usually handled separately with the ipv4.routes or ipv4.routes.external (when dealing with dynamic routing and such). I'm not sure how those behave with the filtering options though as you'd likely need the ipv4/ipv6 filters to get extended to cover that which I don't remember seeing.

@tomponline is that something we're doing and I forgot or do we need to fix that?

I can mention that we have very strict validation on instance names and they are guaranteed to be unique and follow a format (and length) which satisfies the system requirements. We have never seen an instance fail to restart due to an issue with setting the host_name - this is across dozens of hosts, thousands of containers over a period of a couple of years. So in our case this does not seem to be an issue.

You got lucky or your containers aren't doing anything too exciting. Over the years, we've seen a LOT of bugs in the Linux network stack which causes reference counting issues. When that happens, the container's network namespace won't get deleted until the last references are gone (which is problematic in case of circular dependencies), until that happens, all interfaces and addresses which belong to the namespace will still exist, causing potential conflicts.

To be fair, you need to do some slightly less common stuff in those containers to trigger that most of the time, like running nested containers on a bridge, having a bunch of firewall rules, having a bunch of virtual interfaces, ...

It would be quite excellent to get some mechanism whereby we can set the host name (or reliably predict it from some other attribute) without it causing the device override. However, I have made workarounds to the current issue in our CLI tools already so they will now correctly reapply any ingress/egress limits. So my immediate issue is solved by some additional code on my end - but that does not mean that other users might encounter this behavior in other regards. I leave it up to you to decide if this is a priority or not :)

Having an instances.network.host_name global config option that can have random or mac as value is something that should be pretty cheap to do and would take care of such issues. By default we'd do the random host interface names as we do today but if it's set to mac, then we do lxd001122334455 which just happens to fit the 15 bytes we have.

webdock-io commented 2 years ago

@stgraber Thank you for the link to the filtering rule generation - I will have my team review this and see if there is something in there we can grab in order to enhance our own filtering. Seems to me the NDP rules is potentially not something we are doing at the moment so this is helpful.

As to issues with network namespacing: Only thing that comes to mind is that we from time to time do run into this ol' chestnut:

https://github.com/moby/moby/issues/5618

This has become rare in the past year or so but we saw this recently on a fully up to date system on the latest hwe kernel so it is indeed still an issue - but I have the feeling some fixes must have been made as it's become very rare where before this would happen on almost a bi-weekly basis. Whenever this crops up, we are forced to do a system reboot to recover.

Is this the type of issue you are referring to, or something else? :)

stgraber commented 2 years ago

Yep, that kernel logging message is indeed the user visible symptom of a network namespace that's not going away and depending on what's keeping it around can lead to duplicate addresses on the network, interface name conflicts, ...

tomponline commented 2 years ago

Additional addresses and subnets are usually handled separately with the ipv4.routes or ipv4.routes.external (when dealing with dynamic routing and such). I'm not sure how those behave with the filtering options though as you'd likely need the ipv4/ipv6 filters to get extended to cover that which I don't remember seeing.

@tomponline is that something we're doing and I forgot or do we need to fix that?

No thats not handled at the moment, shall we create an issue to track adding that?

tomponline commented 2 years ago

Are you telling me that if we use the managed LXD Bridge it will also take care of embedded ARP/NDP spoofs? If so, what methods do you use to accomplish this?

Yes, we use iptables/ebtables for doing this for ARP/NDP with xtables driver (see https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_xtables.go#L1085-L1092, https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_xtables.go#L1111-L1116 and https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_xtables.go#L1021-L1026), and for the nftables driver see https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_nftables_templates.go#L181-L184 and https://github.com/lxc/lxd/blob/master/lxd/firewall/drivers/drivers_nftables_templates.go#L206

stgraber commented 2 years ago

No thats not handled at the moment, shall we create an issue to track adding that?

Yeah, I think we should, feels like a bug.

tomponline commented 2 years ago

@stgraber can this be closed?

stgraber commented 2 years ago

We should open an issue to introduce instances.network.host_name, then we can close this one I think.

tomponline commented 2 years ago

https://github.com/lxc/lxd/issues/10036

tomponline commented 2 years ago

FYI the host-side MAC derived interface naming has been added in https://github.com/lxc/lxd/pull/10212