canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.35k stars 931 forks source link

Modifying disk quotas causes unnecessary file attribute updates #13115

Closed mariusklocke closed 1 month ago

mariusklocke commented 7 months ago

Required information

Issue description

Updating the root disk size in LXD takes longer for containers with many files, because every file in the root disk is touched. I observed cases where the instance update took more than 5 minutes. LXD updates the fsx_projid and fsx_xflags attributes of all "regular" files. These updates are no-op from my point of view and should be skipped. I cannot image a scenario where there would be a change for those attributes during the lifecycle of a container. Touching every file slows down instance updates although there are no actual changes to the file attributes. I think the code responsible is in here. I think it should be the same issue on EXT4 (not tested).

Steps to reproduce

  1. Create a storage backend with dir driver on a XFS volume with project quotas enabled
  2. Create a container with root disk size of 10GiB
  3. Check extended file attributes using xfs_quota -x -c 'stat' /mnt/containers/c-1/rootfs/etc/hosts
  4. Check ctime using stat /mnt/containers/u1/rootfs/etc/hosts
  5. Update container root disk size to 15GiB: lxc config device set c-1 root size="15GiB"
  6. Re-check extended file attributes and ctime: ctime has been updated, but no file attributes have changed
mariusklocke commented 7 months ago

If you agree to my conclusions: I already spotted an easy fix for this issue. The setProject function is only called from here. Some lines above there already is a check if the quota project ID has changed. I think the issue would be resolved, if the quota.setProject() call would be moved to into the if statement body above.

mariusklocke commented 6 months ago

Is there missing anything in my issue report? I would appreciate any kind of feedback.

I recently discovered, that altering every file also causes errors, when the container executes a process where files are deleted while the instance update in LXD is still executing:

Error: Failed to update device "root": lstat /var/snap/lxd/common/lxd/storage-pools/default/containers/my-awesome-container/rootfs/tmp/YmCbeF: no such file or directory
tomponline commented 6 months ago

Thanks for your report. We will endeavour to investigate this soon.

I'm not sure why the files are all modified, although I suspect its to do with ensuring all files take part in the quota. Perhaps there is a better way.

mariusklocke commented 2 months ago

I just made a PR for this. We encounter the "no such file or directory" error now regularly and have no other option than to retry, which is a bummer for an operation which can take more than 5 minutes. I think a bugfix should be made for LXD 5.21

tomponline commented 2 months ago

We encounter the "no such file or directory" error now regularly and have no other option than to retry, which is a bummer for an operation which can take more than 5 minutes.

A quick fix would be to have LXD detect this sort of error and continue rather than failing.

As for not updating project on all files, this will need more consideration I think.

mariusklocke commented 2 months ago

I guess that how "quick" that fix would depend on who does the coding. I am not a Go developer at all and re-arranging code to fix a logical error is something I can contribute quickly. Getting into how error handling in Go works, is something very different.

tomponline commented 2 months ago

See if https://github.com/canonical/lxd/pull/13815 helps once it lands in latest/edge.

To be clear im not rejecting your PR, but in isolation I don't believe it will prevent the "no such file or directory" errors you're getting because the error appears to be happening in quota.SetProject.

Admittedly the likely hood of it happening is reduced by your PR as LXD won't do two walks of the container's filesystem, first to remove the project and then to re-add the new one.

I think your PR could provide a speed up indeed, but I will need to consider it more carefully because I'd like to understand why it was done that way originally before changing it.

mariusklocke commented 2 months ago

I'm not entirely happy, but thanks for providing an alternative solution to the issue. My primary use case is all containers with all limited root disks. If I understand the handling of quota project IDs in LXD correctly, they should never actually change except there is a new volume ID for the root disk (don't know if that ever happens). Only then the "file walk" should be required. My PR would fix performance and also "not found" errors for my use case, because the "file walk" would never happen in my use case (maybe once on container creation?). But maybe there are other use cases.

Any chance to get your fix in the currently supported LTS versions? Still preparing the LXD 5.21 update.

tomponline commented 2 months ago

Only then the "file walk" should be required. My PR would fix performance and also "not found" errors for my use case, because the "file walk" would never happen in my use case (maybe once on container creation?). But maybe there are other use cases.

Understood, once my PR lands, if you can rebase your PR so its mergeable and then I'll review in more detail. It makes sense to have the file missing check there in any case.

Any chance to get your fix in the currently supported LTS versions? Still preparing the LXD 5.21 update.

Yeah ill backport this and it'll be in LXD 5.21.3.

mariusklocke commented 3 weeks ago

@tomponline I see that there have been backports to stable-5.21 in the last weeks, but the fix for this issue wasn't included?

tomponline commented 2 weeks ago

I've included it in https://github.com/canonical/lxd/pull/14185