canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.33k stars 926 forks source link

lxc restore snapshot stuck #8150

Closed florian-bellencontre closed 3 years ago

florian-bellencontre commented 3 years ago

Required information

Issue description

When I made my snapshot and tried to restore it, it was stuck for more than 5 minutes. I stopped the process by performing a CTRL + C.

I then tried to start my container but it was also stuck for several minutes.

Steps to reproduce

  1. Create a snapshot with the lxc snapshot command
  2. Restore the snapshot with the lxc restore command.
  3. The command is blocked
  4. Perform a CTRL + C to stop the process
  5. Attempt to start the container with an lxc start which also got stuck.

Information to attach

Log:


 - [ ] Container configuration (`lxc config show NAME --expanded`)
```architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu bionic amd64 (20190328_07:42)
  image.os: Ubuntu
  image.release: bionic
  image.serial: "20190328_07:42"
  limits.cpu: "4"
  limits.memory: 4GB
  volatile.base_image: cb87b829e4d70c14fc5fbdca371341e823e312a1e6b129378fd89ec7cce7945c
  volatile.eth120.hwaddr: 00:16:3e:3d:aa:cc
  volatile.eth136.hwaddr: 00:16:3e:0d:17:e1
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
devices:
  eth120:
    name: eth120
    nictype: bridged
    parent: br_vlan120
    type: nic
  eth136:
    name: eth136
    nictype: bridged
    parent: br_vlan136
    type: nic
  root:
    path: /
    pool: default
    size: 100GB
    type: disk
ephemeral: false
profiles:
- net-1
- net-2
- mem-4GB
- cpu-4-cores
stateful: false
description: ""
 - [ ] Other:

root@srv-235:~ # grep default/containers/elearning /proc/*/mountinfo root@srv-235:~ #

stgraber commented 3 years ago

Can you show the output of dmesg?

This very much sounds like a kernel issue.

florian-bellencontre commented 3 years ago

dmesg log:

[Thu Nov 12 14:38:47 2020] audit: type=1400 audit(1605188347.873:4559): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-elearning_</var/snap/lxd/common/lxd>" name="/bin/" pid=177441 comm="(ionclean)" flags="ro, remount, noatime, bind"
[Thu Nov 12 15:08:47 2020] audit: type=1400 audit(1605190148.042:4560): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-elearning_</var/snap/lxd/common/lxd>" name="/bin/" pid=183931 comm="(ionclean)" flags="ro, remount, noatime, bind"
[Thu Nov 12 15:38:47 2020] audit: type=1400 audit(1605191948.244:4561): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-elearning_</var/snap/lxd/common/lxd>" name="/bin/" pid=192543 comm="(ionclean)" flags="ro, remount, noatime, bind"
[Thu Nov 12 16:08:48 2020] audit: type=1400 audit(1605193748.425:4562): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-elearning_</var/snap/lxd/common/lxd>" name="/bin/" pid=58105 comm="(ionclean)" flags="ro, remount, noatime, bind"
[Thu Nov 12 16:38:48 2020] audit: type=1400 audit(1605195548.610:4563): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxd-elearning_</var/snap/lxd/common/lxd>" name="/bin/" pid=71249 comm="(ionclean)" flags="ro, remount, noatime, bind"
[Thu Nov 12 16:51:47 2020] veth800cf365: renamed from eth120
[Thu Nov 12 16:51:47 2020] br_vlan120: port 2(veth2724a662) entered disabled state
[Thu Nov 12 16:51:47 2020] veth5980fba7: renamed from eth136
[Thu Nov 12 16:51:47 2020] br_vlan136: port 2(vethf9de70c8) entered disabled state
[Thu Nov 12 16:51:48 2020] device vethf9de70c8 left promiscuous mode
[Thu Nov 12 16:51:48 2020] br_vlan136: port 2(vethf9de70c8) entered disabled state
[Thu Nov 12 16:51:48 2020] device veth2724a662 left promiscuous mode
[Thu Nov 12 16:51:48 2020] br_vlan120: port 2(veth2724a662) entered disabled state
stgraber commented 3 years ago

Ok, that's all fine, can you show ps fauxww?

florian-bellencontre commented 3 years ago

Unfortunately, it's very very long to look at the ps output and anonymize everything, there are critical production data so I can't give you the output since I may miss something. If you can narrow down what you need, I'll can give you a more succinct extract.

stgraber commented 3 years ago

I'm looking for any "zfs" process running in there.

florian-bellencontre commented 3 years ago

The only zfs process I can find is this one:

root     148954  0.0  0.0   4636  1844 ?        Ss   Nov03   0:00 /bin/sh /snap/lxd/18150/commands/daemon.start
root     149530  2.4  0.1 6231828 197544 ?      Sl   Nov03 311:32  \_ lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
lxd      150514  0.0  0.0  49964  3628 ?        Ss   Nov03   0:05      \_ dnsmasq --keep-in-foreground --strict-order --bind-interfaces --except-interface=lo --pid-file= --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.154.77.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.154.77.2,10.154.77.254,1h --listen-address=fd42:3af8:9c1a:2bf8::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd -g lxd
root      32798  0.0  0.0  37764  4520 ?        S    16:52   0:00      \_ zfs mount nvme1/lxd/containers/elearning
root      32799  0.0  0.0  22412  1388 ?        D    16:52   0:00      |   \_ /bin/mount --no-canonicalize -t zfs -o defaults,noatime,dev,exec,rw,suid,nomand,zfsutil nvme1/lxd/containers/elearning /var/snap/lxd/common/lxd/storage-pools/default/containers/elearning
root       3866  0.0  0.0  37764  4596 ?        S    17:18   0:00      \_ zfs mount nvme1/lxd/containers/elearning
root       3867  0.0  0.0  22412  1328 ?        D    17:18   0:00      |   \_ /bin/mount --no-canonicalize -t zfs -o defaults,noatime,dev,exec,rw,suid,nomand,zfsutil nvme1/lxd/containers/elearning /var/snap/lxd/common/lxd/storage-pools/default/containers/elearning
root      36437  0.0  0.0  37764  4452 ?        S    17:23   0:00      \_ zfs mount nvme1/lxd/containers/elearning
root      36438  0.0  0.0  22412  1308 ?        D    17:23   0:00      |   \_ /bin/mount --no-canonicalize -t zfs -o defaults,noatime,dev,exec,rw,suid,nomand,zfsutil nvme1/lxd/containers/elearning /var/snap/lxd/common/lxd/storage-pools/default/containers/elearning
root       2345  0.0  0.0  37764  4596 ?        S    18:17   0:00      \_ zfs mount nvme1/lxd/containers/elearning
root       2346  0.0  0.0  22412  1336 ?        D    18:17   0:00          \_ /bin/mount --no-canonicalize -t zfs -o defaults,noatime,dev,exec,rw,suid,nomand,zfsutil nvme1/lxd/containers/elearning /var/snap/lxd/common/lxd/storage-pools/default/containers/elearning
stgraber commented 3 years ago

Thanks, this is a kernel bug then.

stgraber commented 3 years ago

In the above output, you can see LXD running zfs mount ..., this calls /bin/mount which is then getting stuck in I/O wait (D state). This denotes being stuck on a kernel operation usually.

You can look at /proc/2346/stack for example to get a better idea of exactly what it's stuck on.

Unfortunately there's nothing LXD can do in such cases. Those processes are unkillable and may only ever return if the kernel recovers, if it doesn't, then only option is a reboot.