canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.33k stars 929 forks source link

Help on server getting stuck #5197

Closed laralar closed 5 years ago

laralar commented 5 years ago

Required information

I have a few hosts as LXD hosts, around 40 total

I have been updating them to the latest upgrades regularly, but since I upgrded them to Ubuntu 18.04, from time to time the server gets stucked, I have to login to the remote console using iLO/iDRAC and reboot the server since it doesn't respond to even keystrokes to be able to do a console login

On the console there are no messages

on /var/log/syslog there is no relevant message, except the last one before getting stucked and the first one after reboot

For example:


Oct 22 17:17:01 node33 CRON[21497]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 22 18:17:01 node33 CRON[5452]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 22 18:33:39 node33 systemd[1]: Starting Daily apt download activities...
Oct 22 18:33:40 node33 systemd[1]: Started Daily apt download activities.
Oct 22 19:17:01 node33 CRON[48541]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Oct 22 19:28:34 node33 snapd[42014]: storehelpers.go:398: cannot refresh: snap has no updates available: "core", "lxd"
Oct 22 19:28:34 node33 snapd[42014]: autorefresh.go:387: auto-refresh: all snaps are up-to-date
Oct 23 09:58:22 node33 systemd-modules-load[793]: Inserted module 'iscsi_tcp'
Oct 23 09:58:22 node33 systemd-modules-load[793]: Inserted module 'ib_iser'
Oct 23 09:58:22 node33 systemd[1]: Started Create list of required static device nodes for the current kernel.
Oct 23 09:58:22 node33 systemd[1]: Started Load Kernel Modules.
Oct 23 09:58:22 node33 systemd[1]: Started LVM2 metadata daemon.

There is nothing else installed or running on the hosts except LXD and a few containers in each hosts. It happens randomly (so far I haven't been able to identify the cause) and if I am not mistaen it started happening when I upgraded from 16.04 to 18.04

At first I thought there was some kernel issue and there were some messages in the console regarding some clues about this. But now there is nothing.

What else could I try to see in which logs to try to get a clue of what is happening? I am not very used to explore system/kernel logs. That is why I ask for help.

LXD log doesn't show anything relevant either. These are the last messages of previous log file lxd.log.1 :

lvl=info msg="Done updating instance types" t=2018-10-22T01:54:31+0530
lvl=info msg="Updating images" t=2018-10-22T07:54:22+0530
lvl=info msg="Done updating images" t=2018-10-22T07:54:22+0530
lvl=info msg="Updating images" t=2018-10-22T13:54:22+0530
lvl=info msg="Done updating images" t=2018-10-22T13:54:22+0530

dmesg, if I am not incorrect, it shows only the information since booting

The isue is happenign on different hardware, I have HP ProLiant DL380p from 24GB of ram to 48GB of ram, Dell servers from 16GB of ram to 128GB of ram, blade servers with 4GB of RAM

It seems to happen more often in some servers that may be overloaded in regards to memory assigned for the running LXD continers (after all the respective containers are up there is hardly any memory left on the host. Could it be that SWAP is finished?

but last night it happened on a server that on average has a few GB free and no major CPU consuming container is running there.

As I mention, only LXD containers I have in the hosts, nothing else. in some I have a few KVMs, but hag ups happen mostly on servers where there is none of these.

Any help would be appreciated

Thanks

Steps to reproduce

NOT being able so far

Information to attach

This is the only software that I have installed on the hosts:

pdsh -R ssh -w ubuntu@10.3.4.[33] sudo apt install -y zfsutils-linux iperf speedtest-cli \
     arp-scan cpu-checker qemu qemu-kvm libvirt-bin bridge-utils zram-config tree pdsh snapd ncdu ntp \
     ntpdate nfs-common criu
pdsh -R ssh -w ubuntu@10.3.4.[48] sudo apt purge -y lxd lxd-client
nohup pdsh -R ssh -w ubuntu@10.3.4.[48] sudo snap install lxd &
stgraber commented 5 years ago

Hi,

I'm going to close this as it's not an actual LXD issue.

I do however have one of my own servers running into this about once a week, exact same symptoms as you described and exact same frustration at not having anything printed on the console.

Can you file a bug at https://launchpad.net/ubuntu/+source/linux/+filebug and post the link to it here?

It's going to be a very hard one to track down but I will provide any help I can since I'm also affected.

Note that in my case, I'm nowhere near running out of memory on that system, CPU isn't busy, storage is NVME SSD without any reported issue and I've actually replaced the entire hardware (moved to another server) and still run into this.

stgraber commented 5 years ago

@laralar https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1799497