canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 925 forks source link

idmap accesses wrong path #10191

Closed debug-richard closed 2 years ago

debug-richard commented 2 years ago

Required information

Issue description

A storage pool got full, so I created a new one on another disk formatted as btrfs. So there are two storage pools on two different BTRFS file systems.

lxc storage create data-nvme btrfs source=/media/data-nvme/lxd/ lxc stop old_container lxc move old_container tmp_container -s data-nvme lxc move tmp_container old_container lxc start old_container

That worked fine, but then I noticed that one of the containers that shares a directory with the host no longer has access permissions.

I compared the configuration with the other containers and found that the container had the "raw.idmap" configuration set.

After the migration, the setting was "raw.idmap: both 100000 100000" (and I'm not sure if lxd changed this during the migration). The shared directory is actually another disk that was mounted as read-only, so this has not changed.

So I tried changing "raw.idmap" to "both 1000 1000" to match the host/container permissions.

Now the container refuses to start with the error message: Error: Failed to handle idmapped storage: invalid argument - Failed to change ACLs on /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/old_container/rootfs/var/log/journal

The problem is that the storage pool is located at /media/data-nvme/lxd/, but lxd tries to access /var/snap/lxd/common/lxd/storage-pools/data-nvme/, which exists but is empty.

So I removed the empty directory and replaced it with a symlink to /media/data-nvme/lxd/, which fixed the problem.

I suspect the problem is:

  1. The idmap function accesses the wrong path
  2. LXD has not created the symlink correctly

Steps to reproduce

See above.

Information to attach

tomponline commented 2 years ago

Please can you remove the symlink you created and then show the output of:

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- ls -la /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/old_container/rootfs/var/log/journal
tomponline commented 2 years ago

This sounds similar to https://discuss.linuxcontainers.org/t/cannot-start-a-copied-container-failed-to-change-acls/7982

debug-richard commented 2 years ago

The path shows nothing after deleting the symlink.

If I create a new pool for testing this is the result:

lxc storage create test btrfs source=/media/data-nvme/lxdtest/

sudo nsenter --mount=/run/snapd/ns/lxd.mnt -- ls -la /var/snap/lxd/common/lxd/storage-pools/test
total 0
drwx--x--x 1 root root  0 Apr  6 12:48 .
drwx--x--x 1 root root 50 Apr  6 12:48 ..

ls -la /media/data-nvme/lxdtest/
total 16
drwxr-xr-x 1 root  root  200 Apr  6 12:48 .
drwxr-xr-x 1 build build 122 Jun 28  2021 ..
drwx--x--x 1 root  root    0 Apr  6 12:48 containers
drwx--x--x 1 root  root    0 Apr  6 12:48 containers-snapshots
drwx--x--x 1 root  root    0 Apr  6 12:48 custom
drwx--x--x 1 root  root    0 Apr  6 12:48 custom-snapshots
drwx--x--x 1 root  root    0 Apr  6 12:48 images
drwx--x--x 1 root  root    0 Apr  6 12:48 virtual-machines
drwx--x--x 1 root  root    0 Apr  6 12:48 virtual-machines-snapshots

It seems that lxd does not mount/link/see the storage pool.

Regarding the mentioned discussion, I checked my history and also deleted the journal as a last step. So this could be related to that.

tomponline commented 2 years ago

Do you get the same error if you delete the journal?

debug-richard commented 2 years ago

I have now deleted the symlink and recreated the (empty) /var/snap/lxd/common/lxd/storage-pools/data-nvme directory. If I start the container I get this error:

lxc start mycontainer 
Error: Failed to create mount directory "/var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/mycontainer": mkdir /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/mycontainer: no such file or directory

When I try to move the container to reproduce the journaling problem, I get:

lxc move mycontainer mycontainer_tmp -s test
Error: Create instance from copy: Create instance volume from copy failed: [lstat /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/mycontainer/: no such file or directory Failed reading migration header: context canceled]

So lxd still tries to access the wrong path.

tomponline commented 2 years ago

So I think we need to get back to the original error, before you started to manually change the storage pool path contents, as it gets quite complicated when taking into account the snap mount namespace.

What I'd like to see is you getting back to the original error Failed to change ACLs on /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/old_container/rootfs/var/log/journal and then removing var/log/journal from inside the instance and trying again.

tomponline commented 2 years ago

Just to be clear, the path /var/snap/lxd/common/lxd/storage-pools/data-nvme/ is not incorrect, the storage pool driver will mount the source of the pool into the standard pool location. So that path being used isn't the issue here.

debug-richard commented 2 years ago

Looking at the log I saved, that's what happened:

  1. Created a new pool and migrated the container
  2. The container was started and running, but access to the mount was not possible because the access rights have changed (nobody:nogroup from inside the container)
  3. I compared the configuration and found that this container has the setting "raw.idmap: both 100000 100000".
  4. Changed it to "raw.idmap: both 1000 1000" to match the host user with the container user
  5. Started the container and got the error: Error: Failed to handle idmapped storage: invalid argument - Failed to change ACLs on /var/snap/lxd/common/lxd/storage-pools/data-nvme/containers/mycontainer/rootfs/var/log/journal
  6. Changed it back to "100000 100000" and it started again
  7. Changed it several times but only "100000 100000" worked
  8. I found that the /var/snap/lxd/common/lxd/storage-pools/data-nvme/ directory is empty and created the symlink
  9. Started the container and got: Error: Failed to handle idmapped storage: invalid argument - Failed to change ACLs on /media/data-nvme/lxdstorage/containers/mycontainer/rootfs/var/log/journal
  10. Removed the journal as final step
  11. The container is now running and can access the shared mount
debug-richard commented 2 years ago

Just to be clear, the path /var/snap/lxd/common/lxd/storage-pools/data-nvme/ is not incorrect, the storage pool driver will mount the source of the pool into the standard pool location. So that path being used isn't the issue here.

Ok, but if I delete the symlink and create an empty directory instead, lxd should mount the pool/container, right?

tomponline commented 2 years ago

Yes but you need to do that inside the snap's mount namespace.

debug-richard commented 2 years ago

I restored the folder but launching it did not work. So I created the symlink again and moved the containers to a new storage pool (under /media/data-nvme/). This worked without problems and all containers are running again.

If the paths were not the problem and the permissions now seem to work (for whatever reason) the only question left is why the ACL problem occurred.

The post https://discuss.linuxcontainers.org/t/cannot-start-a-copied-container-failed-to-change-acls/7982/2 explains that this is caused by an "old bug". LXD 4.0.0 was released in April 2020, the post is from May 2020 and I am using LXD 4.0.x since end of 2020. So this could be a regression. Anyway, I can't reproduce the bug and I didn't save the permissions of the journal files so I can't do anything else.

So let's close this until the next one stumbles over it. thanks anyway