debops / ansible-lxc

Configure and manage LXC environment on a host
GNU General Public License v3.0
17 stars 10 forks source link

Jessie container with systemd fails to start with DebOps default configuration #15

Open ganto opened 9 years ago

ganto commented 9 years ago

Altough this was already discussed in IRC I allow myself to open an issue to track the problem and progress with this issue.

Starting position Create Debian Jessie container on a Jessie LXC host with debops:

lxc_containers:
  - name: 'jessie01'
    template_options: '--release jessie'

This will install systemd by default.

Error When trying to start the container, the following error appears:

# lxc-start -n jessie01
Failed to mount tmpfs at /dev/shm: Operation not permitted

Reason 'cap_sys_admin' is dropped in /var/lib/lxc/jessie01/config as defined in defaults/main.yml and therefore prevents systemd to mount some required file systems:

# List of default POSIX capabilities which should be dropped in all LXC containers
lxc_capabilities_drop: [ 'mknod', 'sys_admin', 'sys_rawio', 'syslog', 'wake_alarm' ]

Known Work-Arounds

Unsuccessful Work-Around I also tried to drop 'cap_sys_admin' and make LXC mount the required file systems without systemd involvement. For this I added:

lxc.mount.entry = tmpfs dev/shm tmpfs nosuid,nodev 0 0
lxc.mount.entry = tmpfs run tmpfs nosuid,relatime 0 0
lxc.mount.entry = tmpfs run/lock tmpfs nosuid,nodev,noexec,relatime 0 0

Unfortunately this fails with the message that /run/lock doesn't exist:

lxc-start: No such file or directory - failed to mount 'tmpfs' on '/usr/lib/x86_64-linux-gnu/lxc/rootfs/run/lock'

Bugs

As I could live with the mentioned systemd bug, I'm still trying to find a way to run it without 'cap_sys_admin'. The challenges then are:

If there are some other possible work-arounds or any hints regarding my open questions, please let me know. I'll update once I found out more

ganto commented 9 years ago

Hi everyone

I found the answer of the LXC mounting error in Re: [systemd-devel] logind vs CAP_SYS_ADMIN-lessness. There is a mount option create=dir.

With the follwoing additional entries in /var/lib/lxc/jessie01/config, it's possible to boot a Jessie systemd container without 'cap_sys_admin':

# Custom container options
lxc.mount.auto = cgroup:mixed
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = debugfs sys/kernel/debug debugfs rw,relatime 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0

Also make sure, that you have the following line in your /etc/lxc/lxc.conf:

lxc.cgroup.use = @all

Otherwise the container start will fail with the following error:

# lxc-start -n jessie01 
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted
ganto commented 9 years ago

After merging #16 setting up a jessie container on a jessie LXC host should now work out of the box.

lyrixx commented 8 years ago

Looks like this issue can be closed?

drybjed commented 8 years ago

Well, nobody else raised any issues, so I guess this can be closed. :-)

kartoffelheinz commented 8 years ago

Sorry to re-open this, but the issue came back with linux kernel 4.6. None of the workarounds except for "Don't drop 'cap_sys_admin' in your container" works. Reverting to kernel 4.5, everything works as expected. This might have to do with the addition of cgroup namespace support in the kernel, see this (and consecutive) pull request: http://lkml.iu.edu/hypermail/linux/kernel/1603.2/02432.html

Do you guys know any workaround here?

drybjed commented 8 years ago

@kartoffelheinz Unfortunately I haven't heard yet anything about this issue. If you find a solution, it would be great to hear it. Thanks for the heads up, I reopened the issue in case anybody else is interested.

geaaru commented 7 years ago

hi, if can help you... from kernel >=4.6 cgroup api/features are been rewrited. As describe on gentoo wiki https://wiki.gentoo.org/wiki/LXC#Configuring_unprivileged_LXC to start unprivileged container is needed mount cgroup filesystem with systemd name.

root #mkdir -p /sys/fs/cgroup/systemd

root #mount -t cgroup -o none,name=systemd systemd /sys/fs/cgroup/systemd

I tested this with kernel 4.8 and 4.9. This solution use cgroup v1 api, currently I don't know how use correctly cgroup v2 api with unprivileged containers.

sherpya commented 7 years ago

@geaaru method worked for me, if you don't use systemd in the host you can add these lines to fstab

cgroup  /sys/fs/cgroup  cgroup  defaults    0   0
systemd /sys/fs/cgroup/systemd  cgroup  name=systemd,x-mount.mkdir=0555 0   0

perhaps I'm stil unable to mount with name=systemd option

kartoffelheinz commented 7 years ago

This issue is still a major PITA.

As of now, it is impossible to run privileged containers without sys_admin capability in latest Debian stable using the 4.9 Kernel with systemd present in both host and guest. System will not load and you can see the following errors in console / logfile.

Freezing execution. Failed to mount tmpfs at /sys/fs/cgroup: Operation not permitted Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory [ESC[0;1;31m!!!!!!ESC[0m] Failed to mount API filesystems, freezing.

None of the workarounds (adding cap_sys is not a workaround anybody should consider) change that, the only way to make it work is to use the old Debian Jessie 3.16 Kernel.

luken commented 2 years ago

Note in case someone else runs into this.

Just updated lxc host to debian 11/bullseye and had some issues with old containers (config not managed by debops). I only had to add the following lines to each of the node's /var/lib/lxc//config file to get them to start.

# needed for drop_cap sys_admin
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0

symptom was

$ lxc-start --foreground --logpriority debug --name container1
Failed to mount tmpfs at /dev/shm: Operation not permitted
Failed to mount tmpfs at /run: Operation not permitted
Failed to mount tmpfs at /run/lock: Operation not permitted
[!!!!!!] Failed to mount API filesystems.
Exiting PID 1...