canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

security.privileged + ubuntu-daily:noble doesn't work - systemd services fail to start #12967

Closed peat-psuwit closed 3 weeks ago

peat-psuwit commented 8 months ago

Required information

Issue description

Setting security.privileged = true config to a container of ubuntu-daily:noble will make it "fails to start". By fail to start, I mean lot of services will not start, including systemd-tmpfiles-setup-dev.service, systemd-resolved.service and systemd-networkd.service. The errors include "systemd-resolved.service: Failed to set up credentials: Protocol error" and "systemd-networkd.service: Failed to set up mount namespacing: Permission denied".

This may be related to https://github.com/lxc/lxc/issues/4402 and seems to be related to AppArmor.

Steps to reproduce

  1. lxc init ubuntu-daily:noble noble-test
  2. lxc config set noble-test security.privileged true
  3. lxc start noble-test
  4. lxc exec noble-test -- journalctl --boot -- sees a lot of failures.

Information to attach

lxc noble-test 20240226191344.130 ERROR conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...

 - [x] Container configuration (`lxc config show NAME --expanded`): [container-config.yml.txt](https://github.com/canonical/lxd/files/14409841/container-config.yml.txt)

 - [x] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

time="2024-02-27T01:39:13+07:00" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"


 - [x] Output of the client with --debug: [lxd-lxc-start.log](https://github.com/canonical/lxd/files/14409861/lxd-lxc-start.log)
 - [x] Output of the daemon with --debug (alternatively output of `lxc monitor` while reproducing the issue): [lxd-lxc-monitor.log](https://github.com/canonical/lxd/files/14409868/lxd-lxc-monitor.log)
tomponline commented 8 months ago

@mihalicyn is this related to the known issue with apparmor parser bug + LXD's workaround apparmor profile and recent versions of systemd?

mihalicyn commented 8 months ago

Hi @peat-psuwit!

Thanks a lot for your report.

Yes, we are aware of some issues with AppArmor in case when privileged container is used. We strongly recommend to always use unprivileged containers as it's safer and also way more stable.

As a workaround, I can suggest you to do: lxc config set noble-test security.nesting=true

This should help to start Ubuntu Noble. But be aware, that this is not fully safe, as theoretically user can mount any file system inside the container when security.nesting is enabled together with security.privileged (for unprivileged case nesting is safe!). I'm not aware of any practical exploit for it but, I just want to warn you. It's in our plan to do something with this AppArmor issues, but, unfortunately not everything depends on us because AppArmor is an external tool.

Let's be in touch on this. And thanks again for reporting!

Kind regards, Alex

tomponline commented 8 months ago

@mihalicyn do you have a link to the reported apparmor issue from 2016?

tomponline commented 8 months ago

@mihalicyn my understanding is that we can address this issue once LXD's snap has a newer apparmor rule parser, which will occur when we switch the base snap to core24. I'll mark this as "later" for now.

peat-psuwit commented 8 months ago

Hi @mihalicyn

The security.nesting workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.

Our script [1], which was written some time ago, specifically asks for privileged container if LXD is detected to be a snap, presumably because it won't read host's /etc/sub{uid,gid}. But a bit of research shows that one can set a custom idmap for the container [2], so I might go that route instead.

[1] If it rings anyone's bell, this is Crossbuilder. [2] https://documentation.ubuntu.com/lxd/en/latest/userns-idmap/#custom-idmaps

mihalicyn commented 8 months ago

I did some additional investigation and found, than systemd these days want's even more than just changing a mount propagation flags. It also wants to rbind /, do pivot_root and stuff. So, effectively, nesting must be enabled to make systemd happy anyways. Which defeats security model of the privileged containers completely. :-(

Patch example (only for experimental use!):

diff --git a/lxd/apparmor/instance_lxc.go b/lxd/apparmor/instance_lxc.go
index d5c9470ad..2eefa63ac 100644
--- a/lxd/apparmor/instance_lxc.go
+++ b/lxd/apparmor/instance_lxc.go
@@ -85,14 +85,14 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
   mount fstype=tmpfs,

   # Allow limited modification of mount propagation
-  mount options=(rw,slave) -> /,
-  mount options=(rw,rslave) -> /,
-  mount options=(rw,shared) -> /,
-  mount options=(rw,rshared) -> /,
-  mount options=(rw,private) -> /,
-  mount options=(rw,rprivate) -> /,
-  mount options=(rw,unbindable) -> /,
-  mount options=(rw,runbindable) -> /,
+  mount options=(rw,slave) -> **,
+  mount options=(rw,rslave) -> **,
+  mount options=(rw,shared) -> **,
+  mount options=(rw,rshared) -> **,
+  mount options=(rw,private) -> **,
+  mount options=(rw,rprivate) -> **,
+  mount options=(rw,unbindable) -> **,
+  mount options=(rw,runbindable) -> **,

   # Allow various ro-bind-*re*-mounts of anything except /proc, /sys and /dev/.lxc
   mount options=(ro,remount,bind) /[^spd]*{,/**},
@@ -296,6 +296,33 @@ profile "{{ .name }}" flags=(attach_disconnected,mediate_deleted) {
   mount options=(rw,rbind) /sy[^s]*{,/**},
   mount options=(rw,rbind) /sys?*{,/**},

+  # workaround modern systemd (unsafe!)
+  mount options=(rw,bind) /,
+  mount options=(rw,bind) /**,
+  mount options=(rw,rbind) /,
+  mount options=(rw,rbind) /**,
+  # Allow common combinations of bind/remount
+  # NOTE: AppArmor bug effectively turns those into wildcards mount allow
+  mount options=(ro,remount,bind),
+  mount options=(ro,remount,bind,nodev),
+  mount options=(ro,remount,bind,nodev,nosuid),
+  mount options=(ro,remount,bind,noexec),
+  mount options=(ro,remount,bind,noexec,nodev),
+  mount options=(ro,remount,bind,nosuid),
+  mount options=(ro,remount,bind,nosuid,nodev),
+  mount options=(ro,remount,bind,nosuid,noexec),
+  mount options=(ro,remount,bind,nosuid,noexec,nodev),
+  mount options=(ro,remount,bind,noatime),
+  mount options=(ro,remount,bind,noatime,nodev),
+  mount options=(ro,remount,bind,noatime,noexec),
+  mount options=(ro,remount,bind,noatime,nosuid),
+  mount options=(ro,remount,bind,noatime,noexec,nodev),
+  mount options=(ro,remount,bind,noatime,nosuid,nodev),
+  mount options=(ro,remount,bind,noatime,nosuid,noexec),
+  mount options=(ro,remount,bind,noatime,nosuid,noexec,nodev),
+  mount options=(ro,remount,bind,nosuid,noexec,strictatime),
+  mount options=(ro,remount,nosuid,noexec,strictatime),
+
   # Allow moving mounts except for /proc, /sys and /dev/.lxc
   mount options=(rw,move) /[^spd]*{,/**},
   mount options=(rw,move) /d[^e]*{,/**},

The security.nesting workaround works. I don't worry about security too much, as our usecase is for development and requires mounting user's directory into the container anyway.

That's a perfectly valid example of usage for privileged containers. In any other case it's strongly recommended to use unprivileged ones.

Just for the future reference, when debugging cases like that it usually makes sense to check:

  1. dmesg | grep DENIED
  2. if lxc config set <CT NAME> security.nesting=true helps. If not, then p.3
  3. if lxc config set <CT NAME> raw.lxc="lxc.apparmor.profile = unconfined" helps.

@tomponline bug link is https://bugs.launchpad.net/apparmor/+bug/1597017

mihalicyn commented 7 months ago

This issue is a real pain. Because as it was said above, systemd now created mount namespaces by default and performs recursive bindmount of / (inside the container) to some other path. Our AppArmor profile (without nesting enabled) blocks this, because having this allowed defeats the path-based AppArmor restrictions completely.

See, for example:

  # Block dangerous paths under /proc/sys
  deny /proc/sys/[^kn]*{,/**} wklx,
  deny /proc/sys/k[^e]*{,/**} wklx,
  deny /proc/sys/ke[^r]*{,/**} wklx,
  deny /proc/sys/ker[^n]*{,/**} wklx,
  deny /proc/sys/kern[^e]*{,/**} wklx,
  deny /proc/sys/kerne[^l]*{,/**} wklx,

If attacker wants to bypass this and can do recursive bindmount, then he/she can:

mkdir /mnt/attacker_playground
mount --rbind / /mnt/attacker_playground
# cool! Now we can access /proc/sys/kernel through /mnt/attacker_playground/proc/sys/kernel/...

That's a reason why security.nesting is not safe when used with privileged containers, at the same time absolutely safe with unprivileged.

tomponline commented 3 weeks ago

@peat-psuwit as using privileged containers with Ubuntu Noble is becoming unviable due to apparmor restrictions and demands from more recent systemd versions, please can you advise what you are doing with containers that needs privilege containers?

We are interested in knowing the use cases so we can look at additional functionality for unprivileged containers.

Thanks

peat-psuwit commented 3 weeks ago

We're using LXD container as a development container similar to devcontainer.json spec, where we have to mount a directory from the host to the container.

https://github.com/ubports/crossbuilder

We used to modify /etc/subuid and /etc/subgid so that we map the user's UID/GID to the container. Then, when a deb -> snap migration happened in Ubuntu, we had to move to privileged container because (if I understand correctly) those files are not effective in snap or something.

Reading the LXD documentation, I guess we're better served by disk device's shift=true config or raw.idmap container config. It's just that we've never gotten around to implement that in our script, because our usecase is developmental in nature anyway.

tomponline commented 3 weeks ago

Yes using disk device here is most certainly the preferable & supported way to achieve this.

See https://documentation.ubuntu.com/lxd/en/latest/reference/devices_disk/