canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

/proc mountpoint issues after lxd snap automatic refresh #7016

Closed AceSlash closed 4 years ago

AceSlash commented 4 years ago

Required information

Issue description

I already noticed during the last update of LXD with snap from 3.20 to 3.21 that /proc/stat wasn't available anymore for some containers and I rebooted them. But this time for the 3.21 to 3.22 update, the issue impacts a large majority of my containers (hard to evaluate with so many alerts but I'll say 70% of containers are affected this time):

# ll /proc |grep "?????"
ls: cannot access '/proc/stat': Transport endpoint is not connected
ls: cannot access '/proc/swaps': Transport endpoint is not connected
ls: cannot access '/proc/uptime': Transport endpoint is not connected
ls: cannot access '/proc/cpuinfo': Transport endpoint is not connected
ls: cannot access '/proc/loadavg': Transport endpoint is not connected
ls: cannot access '/proc/meminfo': Transport endpoint is not connected
ls: cannot access '/proc/diskstats': Transport endpoint is not connected
-?????????  ? ?               ?                  ?            ? cpuinfo
-?????????  ? ?               ?                  ?            ? diskstats
-?????????  ? ?               ?                  ?            ? loadavg
-?????????  ? ?               ?                  ?            ? meminfo
-?????????  ? ?               ?                  ?            ? stat
-?????????  ? ?               ?                  ?            ? swaps
-?????????  ? ?               ?                  ?            ? uptime

Restarting the containers does solve it but on some hosts, the containers then get the stats from the host. Meaning that they see the uptime of the host, the /proc/stat from the host (with all the cpu, memory, even if they have limites of their own).

This container was just stopped, then started:

# uptime 
 08:58:00 up 44 days, 22:23,  1 user,  load average: 2.47, 1.92, 1.70

When I see this last issue, dmesg shows (the container is not privileged):

[Fri Mar 13 08:37:26 2020] audit: type=1400 audit(1584085095.597:677): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-container-name_</var/snap/lxd/common/lxd>" pid=70044 
comm="apparmor_parser"

Not sure if that's relevant.

Steps to reproduce

  1. Install LXD from snap
  2. Wait for a LXD snap automatic update
  3. Your containers will lose the access to /proc/stat, etc

Information to attach

I don't see anything unusual in the logs except the confirmation that this issue happened just after a lxd snap refresh.

On a side note, this time I will definitively stop the lxd automatic refresh for LXD. With an issue like this, I have lost all the monitoring of the containers since it uses /proc/stat inside of them.

I don't want to appear ungrateful for the LXD project which I like very much but this is a very serious issue. We can't use automatic update if they will randomly break parts of the system.

Stability is paramont for something as critical as LXD, I'm not sure if the "stable" release is stable enough to be used with automatic update. I'll gladly use an LTS snap channel with only critical updates.

Having to reboot hundreds of containers to fix this should not even be a thing.

stgraber commented 4 years ago

This actually isn't a LXD or even a snapd issue but a LXCFS reload bug which we've fixed just a few minutes ago and are actively rolling out a fix for now.

Note that to stay on specific version of the snap, we've had versioned tracks for a while now. It wouldn't have helped in this case but if your worry is around automatic jumps in LXD version, it does allow preventing that.

snap refresh lxd --channel=3.22

stgraber commented 4 years ago

https://discuss.linuxcontainers.org/t/mount-fails-when-starting-container-zfs-filesystem-already-mounted-lxd-3-22/7080/2 for some more details on this.

AceSlash commented 4 years ago

Ok thank you for the details.