canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 928 forks source link

"unable to create quota group: File exists" when creating new container #11210

Closed broizter closed 1 year ago

broizter commented 1 year ago

Required information

Issue description

Trying to create new containers fails with this message Error: Failed instance creation: Failed creating instance from image: Failed to run: btrfs qgroup create 0/104960 /var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f: exit status 1 (ERROR: unable to create quota group: File exists). My guess is that it started after upgrading btrfs-progs to 6.0.1. Upgrading to the latest version (6.0.2) did not fix the issue.

Steps to reproduce

  1. Use BTRFS as the storage backend
  2. Create a container

Information to attach

broizter commented 1 year ago

My lxd.log is also completely spammed with "Failed to get disk stats". Not sure if related.

mihalicyn commented 1 year ago

Most probably this error comes from this place in the kernel: https://github.com/torvalds/linux/blob/f10b439638e2482a89a1a402941207f6d8791ff8/fs/btrfs/qgroup.c#L1602 while ioctl is called from this place: https://github.com/kdave/btrfs-progs/blob/441d01556873385d55fd4940f50ee7ae1fcfb13d/cmds/qgroup.c#L1762

Please show the output of btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f and btrfs subvolume show /var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f.

broizter commented 1 year ago

It seems like they don't exist. Very strange.

niklas@tank ~
❯ sudo btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f
ERROR: cannot access '/var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f': No such file or directory

niklas@tank ~
❯ sudo btrfs subvolume show /var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f
ERROR: cannot find real path for '/var/lib/lxd/storage-pools/default/images/f161c60ebcfe5806986bcfef748df7cf23bf7eb39eb5b7c130d0e5aa5371522f': No such file or directory

## I tried to make a new container at this point just to double check. Same issue though.

niklas@tank ~
❯ lxc launch images:archlinux testcontainer
Creating testcontainer
Error: Failed instance creation: Failed creating instance from image: Failed to run: btrfs qgroup create 0/105380 /var/lib/lxd/storage-pools/default/images/85e1313aac5ec2fe801377b72e06fb1843421a67e58921f9239bca8c56711537: exit status 1 (ERROR: unable to create quota group: File exists)

niklas@tank ~
❯ sudo btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/default/images/85e1313aac5ec2fe801377b72e06fb1843421a67e58921f9239bca8c56711537
ERROR: cannot access '/var/lib/lxd/storage-pools/default/images/85e1313aac5ec2fe801377b72e06fb1843421a67e58921f9239bca8c56711537': No such file or directory

niklas@tank ~
❯ sudo btrfs subvolume show /var/lib/lxd/storage-pools/default/images/85e1313aac5ec2fe801377b72e06fb1843421a67e58921f9239bca8c56711537
ERROR: cannot find real path for '/var/lib/lxd/storage-pools/default/images/85e1313aac5ec2fe801377b72e06fb1843421a67e58921f9239bca8c56711537': No such file or directory
mihalicyn commented 1 year ago

Can you see something in ls -la /var/lib/lxd/storage-pools/default/images?

broizter commented 1 year ago

It's empty.

niklas@tank ~
❯ sudo ls -la /var/lib/lxd/storage-pools/default/images
total 0
drwx--x--x 1 root root   0  9 dec 12.52 .
drwxr-xr-x 1 root root 214 26 sep 11.25 ..
mihalicyn commented 1 year ago

Ah, so it means that after an error occurs this subvolume was deleted on the error path.

broizter commented 1 year ago

I guess so. I should mention that I have like 10 containers running without an issue though. The difference is that they were created before this error started occuring, which I think happened after upgrading to btrfs-progs 6.0.1.

mihalicyn commented 1 year ago

Hmm, are these containers on the same node? Which storage backend are you using? It's strange that you have empty /var/lib/lxd/storage-pools/default/images in this case.

broizter commented 1 year ago

Same node. I'm using btrfs storage backend. The duplicate names are snapshots.

niklas@tank ~
❯ lxc storage info default
info:
  description: ""
  driver: btrfs
  name: default
  space used: 129.58GiB
  total space: 464.69GiB
used by:
  instances:
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - apps
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - compiler
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - minecraft
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - mqtt
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - plex
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - samba
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - webserver
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  - youtube
  profiles:
  - default
mihalicyn commented 1 year ago

Please show btrfs subvolume list, cat /proc/1/mountinfo, lsblk

broizter commented 1 year ago

btrfs subvolume list (i apologize in advance for the amount. each docker container is one subvolume and it gets multiplied in every snapshot)

cat /proc/1/mountinfo

22 29 0:20 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
23 29 0:21 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sys rw
24 29 0:5 / /dev rw,nosuid,relatime shared:2 - devtmpfs dev rw,size=8098836k,nr_inodes=2024709,mode=755,inode64
25 29 0:22 / /run rw,nosuid,nodev,relatime shared:12 - tmpfs run rw,mode=755,inode64
26 23 0:23 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime shared:7 - efivarfs efivarfs rw
29 1 0:25 /ROOT / rw,noatime shared:1 - btrfs /dev/mapper/root rw,compress=zstd:3,ssd,space_cache,user_subvol_rm_allowed,subvolid=256,subvol=/ROOT
27 23 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:8 - securityfs securityfs rw
28 24 0:24 / /dev/shm rw,nosuid,nodev shared:3 - tmpfs tmpfs rw,inode64
30 24 0:28 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
31 23 0:29 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:9 - cgroup2 cgroup2 rw
32 23 0:30 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:10 - pstore pstore rw
33 23 0:31 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime shared:11 - bpf bpf rw,mode=700
34 22 0:32 / /proc/sys/fs/binfmt_misc rw,relatime shared:13 - autofs systemd-1 rw,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=15926
35 24 0:19 / /dev/mqueue rw,nosuid,nodev,noexec,relatime shared:14 - mqueue mqueue rw
36 24 0:33 / /dev/hugepages rw,relatime shared:15 - hugetlbfs hugetlbfs rw,pagesize=2M
37 23 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime shared:16 - debugfs debugfs rw
38 23 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime shared:17 - tracefs tracefs rw
58 25 0:34 / /run/credentials/systemd-sysctl.service ro,nosuid,nodev,noexec,relatime shared:18 - ramfs ramfs rw,mode=700
39 23 0:35 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime shared:19 - configfs configfs rw
40 23 0:36 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:20 - fusectl fusectl rw
64 25 0:76 / /run/credentials/systemd-sysusers.service ro,nosuid,nodev,noexec,relatime shared:21 - ramfs ramfs rw,mode=700
66 25 0:77 / /run/credentials/systemd-tmpfiles-setup-dev.service ro,nosuid,nodev,noexec,relatime shared:22 - ramfs ramfs rw,mode=700
91 29 0:25 /CACHE /mnt/cache rw,noatime shared:45 - btrfs /dev/mapper/root rw,compress=zstd:3,ssd,space_cache,user_subvol_rm_allowed,subvolid=3479,subvol=/CACHE
88 29 0:79 / /tmp rw,nosuid,nodev shared:47 - tmpfs tmpfs rw,nr_inodes=1048576,inode64
96 29 259:1 / /boot rw,relatime shared:49 - vfat /dev/nvme0n1p1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro
107 29 0:84 / /mnt/cache1 rw,relatime shared:55 - btrfs /dev/mapper/nvmecache rw,compress=zstd:3,ssd,space_cache=v2,subvolid=5,subvol=/
155 25 0:93 / /run/credentials/systemd-tmpfiles-setup.service ro,nosuid,nodev,noexec,relatime shared:65 - ramfs ramfs rw,mode=700
343 29 0:101 / /var/lib/lxcfs rw,nosuid,nodev,relatime shared:85 - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
361 37 0:12 / /sys/kernel/debug/tracing rw,nosuid,nodev,noexec,relatime shared:101 - tracefs tracefs rw
452 29 0:103 / /var/lib/lxd/shmounts rw,relatime shared:222 - tmpfs tmpfs rw,size=100k,mode=711,inode64
463 29 0:104 / /var/lib/lxd/devlxd rw,relatime shared:246 - tmpfs tmpfs rw,size=100k,mode=755,inode64
474 29 0:25 /ROOT/var/lib/lxd/storage-pools/default /var/lib/lxd/storage-pools/default rw,noatime shared:1 - btrfs /dev/mapper/root rw,compress=zstd:3,ssd,space_cache,user_subvol_rm_allowed,subvolid=295,subvol=/ROOT/var/lib/lxd/storage-pools/default
531 34 0:125 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime shared:257 - binfmt_misc binfmt_misc rw
1645 25 0:958 / /run/user/1000 rw,nosuid,nodev,relatime shared:223 - tmpfs tmpfs rw,size=1622528k,nr_inodes=405632,mode=700,uid=1000,gid=1000,inode64
lsblk

NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda             8:0    0 465.8G  0 disk
└─ssdcache    254:2    0 465.7G  0 crypt
zram0         253:0    0     4G  0 disk  [SWAP]
nvme0n1       259:0    0 931.5G  0 disk
├─nvme0n1p1   259:1    0   512M  0 part  /boot
├─nvme0n1p2   259:2    0 464.7G  0 part
│ └─root      254:0    0 464.7G  0 crypt /var/lib/lxd/storage-pools/default
│                                        /mnt/cache
│                                        /
└─nvme0n1p3   259:3    0 466.3G  0 part
  └─nvmecache 254:3    0 466.3G  0 crypt /mnt/cache1
mihalicyn commented 1 year ago

Likely error comes from this place: https://github.com/lxc/lxd/blob/0e129cfcdf2b04c5f6143de9c79e82b8d65648a2/lxd/storage/backend_lxd.go#L3106

because the volume path contains "image" as a prefix in the path, so it has drivers.VolumeTypeImage.

We can try to perform btrfs quota rescan /var/lib/lxd/storage-pools/default/images or even more globally btrfs quota rescan /var/lib/lxd/storage-pools/default.

mihalicyn commented 1 year ago

@broizter have you tried to rescan quotas?

broizter commented 1 year ago

@broizter have you tried to rescan quotas?

Sorry for late reply! I ran the commands that you mentioned above and they both finished. Unfortunately it didn't change anything.

mihalicyn commented 1 year ago

Then I think we should try downgrading btrfs-progs to 6.0 according to the user experience from https://discuss.linuxcontainers.org/t/lxd-btrfs-archlinux-failed-instance-creation/15633

Then if the issue disappears we should report btrfs guys about this degradation.

mihalicyn commented 1 year ago

Suspicious commits in btrfs-progs: https://github.com/kdave/btrfs-progs/commit/dac73d6e2c68c7fb6955fb1e2121e35289e0ab61

... and this: https://github.com/kdave/btrfs-progs/commit/f486f0f01eb2afcca17e5acb1200e54347e948c8 https://github.com/kdave/btrfs-progs/commit/69b0d7756dd76c0a4a7304165a3d76de0e5170ad

This changes the output format of btrfs qgroup show. I think it may break our code in func (d *btrfs) getQGroup(path string) (string, int64, error) https://github.com/lxc/lxd/blob/master/lxd/storage/drivers/driver_btrfs_utils.go#L250

For instance, qgroupid becomes Qgroupid.

These commits are from v6.0.1

cc @tomponline @monstermunchkin

broizter commented 1 year ago

Downgrading to 6.0 made it possible to create containers again. /var/lib/lxd/storage-pools/default/images was still empty but after creating the new container there is now one directory there. It seems like the directories for containers created earlier are missing though.

mihalicyn commented 1 year ago

@broizter thank you for your report and your help with the experiments. I think we have found a root cause for this problem.

BigB84 commented 1 year ago

Facing same issue on OpenSUSE Tumbleweed LXD version: 5.9 btrfs-progs: 6.0.2

Is there any update on this?

I'm a little bit scared of downgrading btrfs-progs as someone above mentioned data loss... (Also seems impossible to downgrade on OpenSUSE TW) In the same time I need to create new containers.

This is a serious bug.

tomponline commented 1 year ago

Thanks @mihalicyn will take a look as looks like upstream changed something in their tooling .

13werwolf13 commented 1 year ago

i have same problem:

[werwolf@work] ~  
❯ cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20221217"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20221217"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20221217"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

[werwolf@work] ~  
❯ rpm -qa | grep -i btrfspro
btrfsprogs-6.0.2-370.5.x86_64
btrfsprogs-udev-rules-6.0.2-370.5.noarch

[werwolf@work] ~  
❯ rpm -qa | grep -i lxd               
lxd-bash-completion-5.9-1.1.noarch
lxd-5.9-1.1.x86_64
simondeziel commented 1 year ago

FYI, @tomponline it was discussed to use a case insensitive comparison for Qgroupid but https://github.com/kdave/btrfs-progs/commit/69b0d7756dd76c0a4a7304165a3d76de0e5170ad also changed the --- string to a single -.

broizter commented 1 year ago

Issue still exists on LXD 5.10 and btrfs-progs 6.1.2

lxc launch images:archlinux testcontainer
Creating testcontainer
Error: Failed instance creation: Failed creating instance from image: Failed to run: btrfs qgroup create 0/118103 /var/lib/lxd/storage-pools/default/images/b8a25949c295d0da6950277ebc240f867baa162e1c47ad007208a148e88e6489: exit status 1 (ERROR: unable to create quota group: File exists)
tomponline commented 1 year ago

How strange, I tested 5.9 as broken on alpine and the fix worked on there. Perhaps its changed again!

tomponline commented 1 year ago

Can you provide full reproducer steps from a fresh arch install (including installing and setting up lxd, as that didn't work for me, it was complaining during installing dependencies about missing btrfs sources, when I tried it before and followed the arch docs) which is why I switched to alpine to test it instead.

broizter commented 1 year ago

I just did a fresh arch install on a VM and was unable to reproduce the issue, although I did some more testing and noted something interesting.

On my machine with the old arch install, it's possible to "fix" the issue by using an .img file instead of existing path with the BTRFS backend, but on the fresh arch install both methods work.

Output from my machine with the old arch install:

root@tank ~
❯ lxc storage create newbtrfs btrfs
Storage pool newbtrfs created

root@tank ~
❯ lxc storage ls
+----------+--------+------------------------------------+-------------+---------+---------+
|   NAME   | DRIVER |               SOURCE               | DESCRIPTION | USED BY |  STATE  |
+----------+--------+------------------------------------+-------------+---------+---------+
| default  | btrfs  | /var/lib/lxd/storage-pools/default |             | 128     | CREATED |
+----------+--------+------------------------------------+-------------+---------+---------+
| newbtrfs | btrfs  | /var/lib/lxd/disks/newbtrfs.img    |             | 0       | CREATED |
+----------+--------+------------------------------------+-------------+---------+---------+

root@tank ~
❯ lxc launch images:archlinux testcontainer --storage newbtrfs
Creating testcontainer
Starting testcontainer

(container started without issue)

root@tank ~
❯ lxc stop testcontainer

root@tank ~
❯ lxc rm testcontainer

root@tank ~
❯ lxc storage rm newbtrfs
Storage pool newbtrfs deleted

root@tank ~
❯ lxc storage create newbtrfs btrfs source=/var/lib/lxd/storage-pools/newbtrfs
Storage pool newbtrfs created

root@tank ~
❯ lxc storage ls
+----------+--------+-------------------------------------+-------------+---------+---------+
|   NAME   | DRIVER |               SOURCE                | DESCRIPTION | USED BY |  STATE  |
+----------+--------+-------------------------------------+-------------+---------+---------+
| default  | btrfs  | /var/lib/lxd/storage-pools/default  |             | 128     | CREATED |
+----------+--------+-------------------------------------+-------------+---------+---------+
| newbtrfs | btrfs  | /var/lib/lxd/storage-pools/newbtrfs |             | 0       | CREATED |
+----------+--------+-------------------------------------+-------------+---------+---------+

root@tank ~
❯ lxc launch images:archlinux testcontainer --storage newbtrfs
Creating testcontainer
Error: Failed instance creation: Failed creating instance from image: Failed to run: btrfs qgroup create 0/118390 /var/lib/lxd/storage-pools/newbtrfs/images/e02e5aaa2325820ef2d88607de0af1d2aaf41d767fbc6b61ab1e52876e5b97b8: exit status 1 (ERROR: unable to create quota group: File exists)

Also, "/var/lib/lxd/storage-pools/default/images/" is empty even though I have at least one image. Not sure if that's of any interest.

root@tank ~
❯ lxc image ls
+-------+--------------+--------+------------------------------------------+--------------+-----------+----------+-------------------------------+
| ALIAS | FINGERPRINT  | PUBLIC |               DESCRIPTION                | ARCHITECTURE |   TYPE    |   SIZE   |          UPLOAD DATE          |
+-------+--------------+--------+------------------------------------------+--------------+-----------+----------+-------------------------------+
|       | 5f46674de4b6 | no     | Archlinux current amd64 (20230115_04:19) | x86_64       | CONTAINER | 177.56MB | Jan 15, 2023 at 11:23pm (UTC) |
+-------+--------------+--------+------------------------------------------+--------------+-----------+----------+-------------------------------+
root@tank ~
❯ ls -la /var/lib/lxd/storage-pools/default/images
total 0
drwx--x--x 1 root root   0 16 jan 00.19 ./
drwxr-xr-x 1 root root 214 26 sep 11.25 ../

root@tank ~
❯ lxc storage ls
+---------+--------+------------------------------------+-------------+---------+---------+
|  NAME   | DRIVER |               SOURCE               | DESCRIPTION | USED BY |  STATE  |
+---------+--------+------------------------------------+-------------+---------+---------+
| default | btrfs  | /var/lib/lxd/storage-pools/default |             | 128     | CREATED |
+---------+--------+------------------------------------+-------------+---------+---------+
tomponline commented 1 year ago

OK good, so it works on new storage pools then, at least I'm not going mad :)

On the affected system can you show output of:

sudo btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/newbtrfs
tomponline commented 1 year ago

Please can you show lxc storage show newbtrfs?

Also, are you saying that /var/lib/lxd/storage-pools is a single BTRFS device shared with the host OS?

broizter commented 1 year ago

/var/lib/lxd/storage-pools is a BTRFS device shared with the host OS yes. If you use BTRFS on your root partition you will get this option during lxd init Would you like to create a new btrfs subvolume under /var/lib/lxd? (yes/no) [default=yes]: so that's what I'm using.

root@tank ~
❯ btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/newbtrfs
Qgroupid    Referenced    Exclusive  Max exclusive   Path
--------    ----------    ---------  -------------   ----
0/119063         16384        16384           none   ROOT/var/lib/lxd/storage-pools/newbtrfs

root@tank ~
❯ lxc storage show newbtrfs
config:
  source: /var/lib/lxd/storage-pools/newbtrfs
  volatile.initial_source: /var/lib/lxd/storage-pools/newbtrfs
description: ""
name: newbtrfs
driver: btrfs
used_by: []
status: Created
locations:
- none

root@tank ~
❯ btrfs subvolume list /
ID 256 gen 1432480 top level 5 path ROOT
ID 292 gen 18 top level 256 path var/lib/portables
ID 293 gen 19 top level 256 path var/lib/machines
ID 295 gen 1431489 top level 256 path var/lib/lxd/storage-pools/default
ID 119063 gen 1432480 top level 256 path var/lib/lxd/storage-pools/newbtrfs
(etc etc)
tomponline commented 1 year ago

Can you include in btrfs subvolume list / the offending image volume /var/lib/lxd/storage-pools/newbtrfs/images/e02e5aaa2325820ef2d88607de0af1d2aaf41d767fbc6b61ab1e52876e5b97b81 if it still exists?

broizter commented 1 year ago

It does not exist.

root@tank ~
❯ btrfs subvolume list / | grep newbtrfs
ID 119063 gen 1432496 top level 256 path var/lib/lxd/storage-pools/newbtrfs

root@tank ~
❯ ls -la /var/lib/lxd/storage-pools/newbtrfs/images
total 0
drwx--x--x 1 root root   0 16 jan 11.19 ./
drwxr-xr-x 1 root root 214 16 jan 11.10 ../

Here's how it looks on the fresh arch install where everything is functional.

root@archlinux ~# btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/default
ERROR: can't list qgroups: quotas not enabled

root@archlinux ~# btrfs subvolume list / | grep default
ID 267 gen 110 top level 256 path var/lib/lxd/storage-pools/default
ID 269 gen 115 top level 267 path var/lib/lxd/storage-pools/default/containers/testcontainer
ID 271 gen 110 top level 267 path var/lib/lxd/storage-pools/default/images/a160d9f01130bfbd2e29eae8596c52fc8dc75b3219178884eee41fb86c301804

root@archlinux ~# ls -la /var/lib/lxd/storage-pools/default/images/
total 0
drwx--x--x 1 root root 128 Jan 16 11:14 ./
drwxr-xr-x 1 root root 214 Jan 14 12:27 ../
drwx--x--x 1 root root  56 Jan 16 11:13 a160d9f01130bfbd2e29eae8596c52fc8dc75b3219178884eee41fb86c301804/

Interesting that it gives me "ERROR: can't list qgroups: quotas not enabled" on the machine where everything works correctly.

13werwolf13 commented 1 year ago

any news about this issue? I don't feel like pushing, but this is a pretty serious problem that has reduced the ability to use lxd without rebuilding the entire cluster to zero.

tomponline commented 1 year ago

Can you clarify what is working and what isn't, with reproducer steps for a fresh system for the not working scenario?

broizter commented 1 year ago

Oh has the cause not been found yet? I guess I should rebuild my LXD storage then since that seems to be a workaround. Not being able to create new containers for many months now has been a bit annoying.

tomponline commented 1 year ago

The cause was (considered) fixed in https://github.com/lxc/lxd/pull/11252

But its not clear whether its fully resolved or in what situations it is still broken.

tomponline commented 1 year ago

It has been caused by an upstream change to BTRFS tooling, and LXD has had to update its parsing of the command output to accommodate both old and new versions of the BTRFS tooling.

broizter commented 1 year ago

I understand! It's a very strange issue. I was unable to reproduce it with a fresh OS install. For now I will migrate over to using an ".img" based BTRFS backend instead of "existing path" so that I can create containers again.

13werwolf13 commented 1 year ago

Can you clarify what is working and what isn't, with reproducer steps for a fresh system for the not working scenario?

I can't reproduce the problem on new installations, but it affected all existing ones and it's completely unclear to me how it can be fixed. below I will give everything that in my opinion can be useful, if you need something else, ask me, I will add.

my home lab & test server

[werwolf@power] ~  
❯ cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20230128"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20230128"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20230128"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"

[werwolf@power] ~  
❯ inxi -Fxxx            
System:
  Host: power Kernel: 6.1.8-1-default arch: x86_64 bits: 64 compiler: gcc
    v: 12.2.1 Desktop: N/A wm: KWin dm: SDDM Distro: openSUSE Tumbleweed
    20230128
Machine:
  Type: Server System: FUJITSU product: PRIMERGY TX150 S7 v: GS01
    serial: <superuser required> Chassis: type: 17 v: TX150S7FS
    serial: <superuser required>
  Mobo: FUJITSU model: D2759 v: S26361-D2759-A13 WGS04 GS02
    serial: <superuser required> BIOS: FUJITSU // Phoenix
    v: 6.00 Rev. 1.21.2759.A1 date: 07/11/2018
CPU:
  Info: quad core model: Intel Xeon X3430 bits: 64 type: MCP
    smt: <unsupported> arch: Nehalem rev: 5 cache: L1: 256 KiB L2: 1024 KiB
    L3: 8 MiB
  Speed (MHz): avg: 2527 min/max: 1197/2395 boost: enabled cores: 1: 2527
    2: 2527 3: 2527 4: 2527 bogomips: 19150
  Flags: ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Matrox Systems MGA G200e [Pilot] ServerEngines
    vendor: Fujitsu Solutions driver: mgag200 v: kernel pcie: speed: 2.5 GT/s
    lanes: 1 ports: active: VGA-1 empty: none bus-ID: 14:00.0
    chip-ID: 102b:0522 class-ID: 0300
  Display: unspecified server: X.Org v: 21.1.6 with: Xwayland v: 22.1.7
    driver: X: loaded: N/A unloaded: nvidia gpu: mgag200 note: X driver n/a
    display-ID: localhost:10.0 screens: 1
  Screen-1: 0 s-res: 3840x1080 s-dpi: 96 s-size: 1016x285mm (40.00x11.22")
    s-diag: 1055mm (41.54")
  Monitor-1: DisplayPort-0 pos: right res: 1920x1080 hz: 60 dpi: 93
    size: 527x296mm (20.75x11.65") diag: 604mm (23.8") modes: N/A
  Monitor-2: HDMI-A-0 pos: primary,left res: 1920x1080 hz: 60 dpi: 93
    size: 527x296mm (20.75x11.65") diag: 604mm (23.8") modes: N/A
  API: OpenGL v: 4.5 Mesa 22.3.4 renderer: llvmpipe (LLVM 15.0.7 128 bits)
    direct render: Yes
Audio:
  Message: No device data found.
  Sound Server-1: PulseAudio v: 16.1 running: no
  Sound Server-2: PipeWire v: 0.3.65 running: no
Network:
  Device-1: Intel 82571EB/82571GB Gigabit Ethernet driver: e1000e v: kernel
    pcie: speed: 2.5 GT/s lanes: 4 port: 3000 bus-ID: 11:00.0 chip-ID: 8086:10bc
    class-ID: 0200
  IF: eth0 state: up speed: 100 Mbps duplex: full mac: 00:15:17:e8:e2:39
  Device-2: Intel 82571EB/82571GB Gigabit Ethernet driver: e1000e v: kernel
    pcie: speed: 2.5 GT/s lanes: 4 port: 3020 bus-ID: 11:00.1 chip-ID: 8086:10bc
    class-ID: 0200
  IF: eth1 state: down mac: 00:15:17:e8:e2:38
  Device-3: Intel 82571EB/82571GB Gigabit Ethernet driver: e1000e v: kernel
    pcie: speed: 2.5 GT/s lanes: 4 port: 4000 bus-ID: 12:00.0 chip-ID: 8086:10bc
    class-ID: 0200
  IF: eth2 state: down mac: 00:15:17:e8:e2:3b
  Device-4: Intel 82571EB/82571GB Gigabit Ethernet driver: e1000e v: kernel
    pcie: speed: 2.5 GT/s lanes: 4 port: 4020 bus-ID: 12:00.1 chip-ID: 8086:10bc
    class-ID: 0200
  IF: eth3 state: up speed: 1000 Mbps duplex: full mac: 00:15:17:e8:e2:3a
  Device-5: Intel 82574L Gigabit Network vendor: Fujitsu Solutions
    driver: e1000e v: kernel pcie: speed: 2.5 GT/s lanes: 1 port: 5000
    bus-ID: 13:00.0 chip-ID: 8086:10d3 class-ID: 0200
  IF: ens0 state: up speed: 1000 Mbps duplex: full mac: 00:19:99:b8:5b:f4
  IF-ID-1: br0 state: up speed: 10000 Mbps duplex: unknown
    mac: 82:38:4c:7f:80:db
  IF-ID-2: veth120540aa state: up speed: 10000 Mbps duplex: full
    mac: b6:49:74:6c:24:e6
  IF-ID-3: wg0 state: unknown speed: N/A duplex: N/A mac: N/A
  IF-ID-4: ygg0 state: unknown speed: 10 Mbps duplex: full mac: N/A
  IF-ID-5: zt0 state: unknown speed: 10 Mbps duplex: full
    mac: ce:37:b4:f9:bb:93
Drives:
  Local Storage: total: 13.2 TiB used: 12.54 TiB (95.0%)
  ID-1: /dev/sda vendor: Western Digital model: WUH721414ALE604
    size: 12.73 TiB speed: 3.0 Gb/s type: HDD rpm: 7200 serial: QGKDSLUT
    rev: W110 scheme: GPT
  ID-2: /dev/sdb vendor: Micron model: 1100 MTFDDAK512TBN size: 476.94 GiB
    speed: 3.0 Gb/s type: SSD serial: 163814327E3A rev: U001 scheme: GPT
Partition:
  ID-1: / size: 440 GiB used: 280.28 GiB (63.7%) fs: btrfs dev: /dev/sdb2
  ID-2: /home size: 12.73 TiB used: 12.27 TiB (96.3%) fs: btrfs
    dev: /dev/sda1
Swap:
  ID-1: swap-1 type: partition size: 35.9 GiB used: 900.5 MiB (2.4%)
    priority: -2 dev: /dev/sdb3
Sensors:
  System Temperatures: cpu: 60.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 275 Uptime: 2h 1m wakeups: 3 Memory: 31.34 GiB
  used: 9.46 GiB (30.2%) Init: systemd v: 252 target: multi-user (3)
  default: multi-user Compilers: gcc: 12.2.1 alt: 11/12/13 Packages: pm: rpm
  pkgs: N/A note: see --rpm Shell: Zsh v: 5.9 running-in: sshd (SSH)
  inxi: 3.3.23

[werwolf@power] ~  
❯ rpm -qa | grep -E 'btrfs|lxd'                                                                                                                                                                                                                                            ⏎
lxd-5.9-2.1.x86_64
btrfsprogs-udev-rules-6.1.3-375.1.noarch
libbtrfs0-6.1.3-375.1.x86_64
lxd-bash-completion-5.9-2.1.noarch
libbd_btrfs2-2.28-1.1.x86_64
libudisks2-0_btrfs-2.9.4-6.1.x86_64
btrfsprogs-6.1.3-375.1.x86_64
btrfsmaintenance-0.5-67.64.noarch

[werwolf@power] ~  
❯ lxc launch images:almalinux/8 oo     
Creating oo
Error: Failed instance creation: Failed creating instance from image: Failed to run: btrfs qgroup create 0/3619 /var/lib/lxd/storage-pools/local/images/bb49e749b7a31656a0c1e4c6ac4e407a2d600bbb0f0beb79ad02fe4eb5fd0253: exit status 1 (ERROR: unable to create quota group: File exists)

[werwolf@power] ~  
❯ sudo ls -lah /var/lib/lxd/storage-pools/local/images/                                                                                                                                                                                                                    ⏎
[sudo] пароль для root: 
итого 0
drwx--x--x 1 root root   0 фев  3 16:20 .
drwxr-xr-x 1 root root 214 янв 19 04:02 ..

[werwolf@power] ~  
❯ sudo btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/     
Qgroupid    Referenced    Exclusive  Max exclusive   Path 
--------    ----------    ---------  -------------   ---- 
0/258     260327317504   3302572032           none   @/.snapshots/1/snapshot
broizter commented 1 year ago

Same experience here. Affects existing installations but can't reproduce on fresh installs. The "fix" is to create a new BTRFS storage backend and migrate to that one instead. You have to ommit the "source=" argument when creating the new backend otherwise you will face the same issue. It needs to create an ".img" file.

tomponline commented 1 year ago

You have to ommit the "source=" argument when creating the new backend

Do you still see the same issue without using source inside a fresh machine (a VM for instance)?

broizter commented 1 year ago

No, I'm unable to reproduce the issue on a fresh machine. Using "source" works without issue on fresh machines, but on existing installs it causes errors when launching containers for example.

tomponline commented 1 year ago

Right, makes sense, thanks. So it looks like the BTRFS quotas have gotten into a mess on the existing systems. I'll see if we can fix that somehow.

broizter commented 1 year ago

I mentioned it in an earlier post, but the difference between my existing install and a fresh one is that btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/default lists the quota groups on the broken existing install, but if you run the same command on the fresh install where everything works it will instead give this error ERROR: can't list qgroups: quotas not enabled.

So basically existing broken install = quotas are enabled. Fresh working install = quotas are not enabled.

tomponline commented 1 year ago

Yeah I saw that but it doesn't really make any sense to me. Are quotas ever working on the "fixed" systems?

broizter commented 1 year ago

Not sure, how do I test that?

tomponline commented 1 year ago

Set a low quota and try to fill it up.

broizter commented 1 year ago

I will test this and come back when I have some more time.

An additional note and I'm sorry if this is confusing. On my broken existing install I created a new storage backend without using "source" so it instead creates an .img loop device, and when I run btrfs qgroup show -e -f --raw /var/lib/lxd/storage-pools/newbtrfs/ it also gives me ERROR: can't list qgroups: quotas not enabled even though everything now works perfectly. At least as far as I can tell.

So the common thing between the fresh working install where "source" and ".img" works and the broken install where only "img" works is that the output shows that quotas are not enabled on the working storage backends.

TLDR; If BTRFS says quotas are enabled then you get errors, if BTRFS says qutoas are not enabled everything works fine. On fresh installs quotas are not enabled but on existing installs they are.

broizter commented 1 year ago

Is BTRFS quotas supposed to be enabled by LXD? It almost seems like there is a bug that prevents them from being enabled on newly created storage pools.

michpolicht commented 1 year ago

Same problem, openSUSE Tumbleweed. I've disabled quota on a subvolume with existing storage pool using btrfs quota disable /var/lib/lxd/storage-pools/default and I am able to create containers.

broizter commented 1 year ago

Thanks for the tip, that does indeed "fix" the issue. I will use that as a workaround for now instead of migrating all my containers to a new storage pool.