canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Error reading host's cpuset.cpus #10441

Closed bet4it closed 2 years ago

bet4it commented 2 years ago

I get the error Error reading host's cpuset.cpus which has been discussed in https://discuss.linuxcontainers.org/t/getting-below-erro-while-starting-lxd/10005. systemd.unified_cgroup_hierarchy=0 works for me, but I want to know why it happens, so I try to find the reason.

This is my environment:

$ cat /proc/1/cgroup
1:net_cls:/
0::/init.scope
$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
net_cls on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls)

The code that cause this error: https://github.com/lxc/lxd/blob/4de2c3611e54c2209409b1e66ec1254b23c4f412/lxd/cgroup/file.go#L69-L81 controller here is cpuset and rw.paths[controller] here is empty. cgLayout here is CgroupsHybrid so the final path we get is cpuset.cpus.effective, which can't be opened.

stgraber commented 2 years ago

What distribution is this on? And what kind of LXD package is this?

You're running some very odd hybrid of cgroup2 and cgroup1 here so it could be that LXD isn't considering your system to be fully cgroup2 and that's causing some issues.

bet4it commented 2 years ago

I'm using the latest Manjaro with LXD 5.1 installed. I have another Manjaro which doesn't have this situation, so net_cls should be enabled by some packages I installed.

jthompson333 commented 2 years ago

This might be happening to me as well. I also am on the same manjaro setup. Do you know what enabled net_cls?

https://discuss.linuxcontainers.org/t/manjaro-unprivileged-container-stopped-starting-on-lxd/14214

jthompson333 commented 2 years ago
May 27 20:58:31 patton systemd[1]: Starting LXD Container Hypervisor...
May 27 20:58:31 patton lxd[4854]: time="2022-05-27T20:58:31-04:00" level=warning msg=" - Couldn't find the CGroup devices controller, device access control won't work"
May 27 20:58:31 patton lxd[4854]: time="2022-05-27T20:58:31-04:00" level=warning msg=" - Couldn't find the CGroup freezer controller, pausing/resuming containers won't work"
May 27 20:58:31 patton lxd[4854]: time="2022-05-27T20:58:31-04:00" level=warning msg=" - Couldn't find the CGroup hugetlb controller, hugepage limits will be ignored"
May 27 20:58:31 patton lxd[4854]: time="2022-05-27T20:58:31-04:00" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
May 27 20:58:31 patton lxd[4854]: time="2022-05-27T20:58:31-04:00" level=warning msg="Instance type not operational" driver=qemu err="KVM support is missing (no /dev/kvm)" type=vi>
May 27 20:58:32 patton dnsmasq[4926]: started, version 2.86 cachesize 150
May 27 20:58:32 patton dnsmasq[4926]: compile time options: IPv6 GNU-getopt DBus no-UBus i18n IDN2 DHCP DHCPv6 no-Lua TFTP conntrack ipset auth cryptohash DNSSEC loop-detect inoti>
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCP, IP range , lease time 1h
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCPv6 stateless on lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCPv4-derived IPv6 names on lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: router advertisement on lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCPv6 stateless on ::, constructed for lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCPv4-derived IPv6 names on ::, constructed for lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: router advertisement on ::, constructed for lxdbr0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: IPv6 router advertisement enabled
May 27 20:58:32 patton dnsmasq-dhcp[4926]: DHCP, sockets bound exclusively to interface lxdbr0
May 27 20:58:32 patton dnsmasq[4926]: using only locally-known addresses for lxd
May 27 20:58:32 patton dnsmasq[4926]: reading /etc/resolv.conf
May 27 20:58:32 patton dnsmasq[4926]: using nameserver 
May 27 20:58:32 patton dnsmasq[4926]: using only locally-known addresses for lxd
May 27 20:58:32 patton dnsmasq[4926]: read /etc/hosts - 6 addresses
May 27 20:58:32 patton dnsmasq-dhcp[4926]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts/
May 27 20:58:32 patton dnsmasq-dhcp[4926]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts/debian11-openqm.eth0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: read /var/lib/lxd/networks/lxdbr0/dnsmasq.hosts/test-debian.eth0
May 27 20:58:32 patton dnsmasq-dhcp[4926]: not giving name patton to the DHCP lease of  because the name exists in /etc/hosts with address 127.0.1.1
May 27 20:58:33 patton lxd[4854]: time="2022-05-27T20:58:33-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:33 patton systemd[1]: Started LXD Container Hypervisor.
May 27 20:58:49 patton lxd[4854]: time="2022-05-27T20:58:49-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:49 patton lxd[4854]: time="2022-05-27T20:58:49-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:49 patton lxd[4854]: time="2022-05-27T20:58:49-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:51 patton lxd[4854]: time="2022-05-27T20:58:51-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:51 patton lxd[4854]: time="2022-05-27T20:58:51-04:00" level=error msg="Error reading host's cpuset.cpus"
May 27 20:58:51 patton lxd[4854]: time="2022-05-27T20:58:51-04:00" level=error msg="Error reading host's cpuset.cpus"
ja-softdevel commented 2 years ago

I'm having similar issues. I have Ubuntu 22.04 installed.

$ cat /proc/1/cgroup
10:net_prio:/
9:perf_event:/
8:net_cls:/
7:freezer:/
6:devices:/
3:cpuacct:/
0::/init.scope

.

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,relatime,net_prio)

.

$ snap logs lxd
2022-06-02T09:28:01-05:00 lxd.daemon[42309]: time="2022-06-02T09:28:01-05:00" level=error msg="Error reading host's cpuset.cpus"
2022-06-02T09:28:01-05:00 lxd.daemon[42120]: => First LXD execution on this system
2022-06-02T09:28:01-05:00 lxd.daemon[42120]: => LXD is ready
2022-06-02T09:29:53-05:00 lxd.daemon[42309]: time="2022-06-02T09:29:53-05:00" level=error msg="Error reading host's cpuset.cpus"
2022-06-02T09:29:53-05:00 lxd.daemon[42309]: time="2022-06-02T09:29:53-05:00" level=error msg="Failed closing listener connection" err="close unix /var/snap/lxd/common/lxd/unix.socket->@: use of closed network connection" listener=587f5bb8-3ac0-43f0-86bb-876e52dd6322
2022-06-02T09:29:54-05:00 lxd.daemon[42309]: time="2022-06-02T09:29:54-05:00" level=error msg="Error reading host's cpuset.cpus"
2022-06-02T09:30:22-05:00 lxd.daemon[42309]: time="2022-06-02T09:30:22-05:00" level=error msg="Failed closing listener connection" err="close unix /var/snap/lxd/common/lxd/unix.socket->@: use of closed network connection" listener=e13c10ea-45a7-4dd3-afc5-10e1ad764965
2022-06-02T09:31:00-05:00 lxd.daemon[42309]: time="2022-06-02T09:31:00-05:00" level=error msg="Error reading host's cpuset.cpus"
2022-06-02T09:31:00-05:00 lxd.daemon[42309]: time="2022-06-02T09:31:00-05:00" level=error msg="Failed closing listener connection" err="close unix /var/snap/lxd/common/lxd/unix.socket->@: use of closed network connection" listener=ba4e2b70-3416-4539-9d69-a813f2812716
2022-06-02T09:31:01-05:00 lxd.daemon[42309]: time="2022-06-02T09:31:01-05:00" level=error msg="Error reading host's cpuset.cpus"

Also as a note, I have some Docker images that run systemd inside them. (not my choice, I didn't make them) But Docker will not work with all the work arounds for mount/passing through the dbus/systemd/cgroup stuff.

I'm starting to think the host OS is jammed up.

blauskaerm commented 2 years ago

Have the same issue. I'm running Manjaro 21.2.6 and lxd 5.0.0 on kernel 5.10.109-1-MANJARO

stgraber commented 2 years ago

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,relatime,cpuacct) cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,relatime,net_cls) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,relatime,perf_event) cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,relatime,net_prio)

You must have something on your system which is seriously messing up your cgroups... This kind of setup is very much unsupported so you'll need to look at what's causing it.

A normal 22.04 system looks like:

stgraber@dakara:~$ grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
stgraber commented 2 years ago

I'm still looking at the manjaro case as that's a bit less extreme, though arguably still pretty wrong.

stgraber commented 2 years ago

I've sent a branch which now detects this situation and fixes LXD's own detection of this behavior, effectively ignoring the invalid CGroupV1 mounts.

Though this does work at quiescing the LXD error and now showing a clear warning that this isn't a supported setup, some of those combinations have been causing container startup errors for me.

This may be a kernel restriction to prevent such an invalid setup being combined with a cgroup namespace as the cgroup1 mounts overmount the cgroup2 root and therefore mounting a new clean cgroup2 tree would un-mask this mount potentially exposing data. That's been a pattern we've seen in other places.

In any case, the answer is pretty much the same in all cases here. Mounting a cgroup1 tree on top of a cgroup2 tree is not a valid setup. This is an abuse of cgroup2 (as a cgroup is technically created just to be over-mounted) and most likely is the result of some random script not detecting that /sys/fs/cgroup is a cgroup2 tree rather than a tmpfs as it is on cgroup1.

I'd recommend looking for scripts on your system which would be performing that cgroup1 mount and getting rid of it or if it came from your distro, reporting it to your distro.

In @ja-softdevel's case, this looks a lot like the result of having cgroupfs-mount or cgroup-lite installed on your system. If you have either of those, remove that package and reboot your system, that should take care of it.

bet4it commented 2 years ago

Still can't work with git version of LXD. Get this error now: https://discuss.linuxcontainers.org/t/lxd-5-0-on-ubuntu-22-04-lts-fails-to-start-22-04-20-04-containers/13805

stgraber commented 2 years ago

Yeah, that's what I referred to above. As mentioned, the main "fix" here is that LXD now detects this situation and logs a clear warning that it's an unsupported configuration. The code change makes it more likely for LXD to handle it which is why the cgroup error is gone, but things can/will still fail further down the road.

The real fix is for you to unmount anything that's over-mounting /sys/fs/cgroup.

In your case, umount /sys/fs/cgroup/net_cls should fix it.

Miosame commented 2 years ago

getting the same issue on regular arch linux and unmounting net_cls doesn't fix it, is there a way to determine what "over-mounts" /sys/fs/cgroup? note I changed from regular lxc to lxd and I am running libvirt/docker on this host, is any of this a clue as to what might be wrong?

Edit: running from snap works, so possibly the package from arch is behind

ja-softdevel commented 2 years ago

This ended up resolving my issues.

$ echo 'GRUB_CMDLINE_LINUX=systemd.unified_cgroup_hierarchy=false' > /etc/default/grub.d/cgroup.cfg
$ update-grub

Found it reading this issue. https://github.com/systemd/systemd/issues/13477#issuecomment-528113009