canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 925 forks source link

`msg="Failed to get disk stats" err="unexpected EOF"` when collecting metrics for a certain container #10746

Closed simondeziel closed 2 years ago

simondeziel commented 2 years ago

Required information

root@xeon:~# snap list lxd Name Version Rev Tracking Publisher Notes lxd 5.0.0-b0287c1 22923 5.0/stable/… canonical✓ -


# Issue description

I recently reinstalled my host (named `xeon`) with Ubuntu 22.04.1. After some time, it started throwing this error on every prometheus scrape (every 15s):

Aug 3 07:00:07 xeon lxd.daemon[1313]: time="2022-08-03T07:00:07Z" level=warning msg="Failed to get disk stats" err="unexpected EOF" instance=metrics instanceType=container project=default Aug 3 07:00:22 xeon lxd.daemon[1313]: time="2022-08-03T07:00:22Z" level=warning msg="Failed to get disk stats" err="unexpected EOF" instance=metrics instanceType=container project=default Aug 3 07:00:37 xeon lxd.daemon[1313]: time="2022-08-03T07:00:37Z" level=warning msg="Failed to get disk stats" err="unexpected EOF" instance=metrics instanceType=container project=default ...


A `snap restart lxd` made it go away until it came back the day after. The warning is always about the container named `metrics`.

The container's config:

root@xeon:~# lxc config show --expanded metrics architecture: x86_64 config: image.architecture: amd64 image.description: Ubuntu focal amd64 (20220113_07:42) image.os: Ubuntu image.release: focal image.serial: "20220113_07:42" image.type: squashfs image.variant: default limits.cpu.allowance: 100% limits.memory: 512MiB limits.processes: "500" security.devlxd: "false" security.idmap.isolated: "true" security.nesting: "true" security.privileged: "false" security.protection.delete: "true" security.syscalls.deny_compat: "true" snapshots.expiry: 3d snapshots.schedule: '@daily, @startup' volatile.base_image: cd37dfe79d6edd4ab36943f5ca4226d47280285772bb457b622bbcec92fe350f volatile.cloud-init.instance-id: 9220ad48-38c7-42de-93e0-a9fd21046d1c volatile.eth0.host_name: vethba646770 volatile.eth0.hwaddr: 00:16:3e:bb:f5:f6 volatile.eth0.name: eth0 volatile.idmap.base: "1131072" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1131072,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1131072,"Nsid":0,"Maprange":65536}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1131072,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1131072,"Nsid":0,"Maprange":65536}]' volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1131072,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":1131072,"Nsid":0,"Maprange":65536}]' volatile.last_state.power: RUNNING volatile.uuid: 7760acd1-e480-483e-949d-95b4d43cdd2d devices: eth0: network: int type: nic prometheus: path: /var/snap/prometheus/common/ pool: default source: prometheus type: disk root: path: / pool: default size: 4GiB type: disk ephemeral: false profiles:

That container is one of many on that server and other containers also have volumes attached to them:

root@xeon:~# lxc ls
+---------+---------+---------------------+-------------------------------+-----------+-----------+
|  NAME   |  STATE  |        IPV4         |             IPV6              |   TYPE    | SNAPSHOTS |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| log     | RUNNING | 172.24.21.51 (eth0) | 2001:470:b1c3:7941::51 (eth0) | CONTAINER | 4         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| metrics | RUNNING | 172.24.21.66 (eth0) | 2001:470:b1c3:7941::66 (eth0) | CONTAINER | 0         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| puppet  | RUNNING | 172.24.21.40 (eth0) | 2001:470:b1c3:7941::40 (eth0) | CONTAINER | 4         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| redmine | STOPPED |                     |                               | CONTAINER | 0         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| smb     | RUNNING | 172.24.28.45 (eth0) | 2001:470:b1c3:7948::45 (eth0) | CONTAINER | 4         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+
| squid   | RUNNING | 172.24.21.28 (eth0) | 2001:470:b1c3:7941::28 (eth0) | CONTAINER | 4         |
+---------+---------+---------------------+-------------------------------+-----------+-----------+

The only unusual thing about metrics is that it runs snapd and has some snaps installed inside it.

root@metrics:~# snap list
Name        Version   Rev    Tracking       Publisher   Notes
core20      20220719  1587   latest/stable  canonical✓  base
prometheus  2.32.1    73     20.04/edge     canonical✓  -
snapd       2.56.2    16292  latest/stable  canonical✓  snapd
root@metrics:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0    7:0    0    62M  1 loop 
loop1    7:1    0    62M  1 loop 
loop2    7:2    0    80M  1 loop 
loop3    7:3    0    47M  1 loop 
sda      8:0    0 232.9G  0 disk 
├─sda1   8:1    0     1M  0 part 
├─sda2   8:2    0    24G  0 part 
├─sda3   8:3    0     2G  0 part 
└─sda4   8:4    0   128G  0 part 
sdb      8:16   0 232.9G  0 disk 
├─sdb1   8:17   0     1M  0 part 
├─sdb2   8:18   0    24G  0 part 
├─sdb3   8:19   0     2G  0 part 
└─sdb4   8:20   0   128G  0 part 
sdc      8:32   0   2.7T  0 disk 
├─sdc1   8:33   0   2.7T  0 part 
└─sdc9   8:41   0     8M  0 part

In the above, the sda, sdb and sdc devices are leaked from the host :/

Comparing the cgroup files for metrics with those of another container (squid), we see that io.stat is populated only for metrics:

root@xeon:~# grep . /sys/fs/cgroup/lxc.payload.metrics/io.*
/sys/fs/cgroup/lxc.payload.metrics/io.pressure:some avg10=0.00 avg60=0.00 avg300=0.00 total=10851611
/sys/fs/cgroup/lxc.payload.metrics/io.pressure:full avg10=0.00 avg60=0.00 avg300=0.00 total=3709916
/sys/fs/cgroup/lxc.payload.metrics/io.prio.class:no-change
/sys/fs/cgroup/lxc.payload.metrics/io.stat:8:16 
/sys/fs/cgroup/lxc.payload.metrics/io.weight:default 100
root@xeon:~# grep . /sys/fs/cgroup/lxc.payload.squid/io.*
/sys/fs/cgroup/lxc.payload.squid/io.pressure:some avg10=0.00 avg60=0.00 avg300=0.00 total=690648
/sys/fs/cgroup/lxc.payload.squid/io.pressure:full avg10=0.00 avg60=0.00 avg300=0.00 total=616654
/sys/fs/cgroup/lxc.payload.squid/io.prio.class:no-change
/sys/fs/cgroup/lxc.payload.squid/io.weight:default 100

Only metrics has content in its io.stat file:

root@xeon:~# grep . /sys/fs/cgroup/lxc.payload.*/io.stat
/sys/fs/cgroup/lxc.payload.metrics/io.stat:8:16
tomponline commented 2 years ago

At the very least we need to improve the quality of the errors here.

simondeziel commented 2 years ago

@tomp, it happened again (no surprise) but I thought this was worth capturing:

# I see the problematic: msg="Failed to get disk stats" err="unexpected EOF"

root@xeon:~# grep -n . /sys/fs/cgroup/lxc.payload.*/io.stat
/sys/fs/cgroup/lxc.payload.metrics/io.stat:1:8:0 8:16 rbytes=352256 wbytes=0 rios=10 wios=0 dbytes=0 dios=0
/sys/fs/cgroup/lxc.payload.metrics/io.stat:2:7:1 rbytes=206848 wbytes=0 rios=10 wios=0 dbytes=0 dios=0
/sys/fs/cgroup/lxc.payload.metrics/io.stat:3:7:2 rbytes=2048 wbytes=0 rios=1 wios=0 dbytes=0 dios=0
root@xeon:~# lxc restart metrics
root@xeon:~# sleep 30
root@xeon:~# grep -n . /sys/fs/cgroup/lxc.payload.*/io.stat
root@xeon:~#