Closed laralar closed 6 years ago
Hi,
I'm going to close this as it's not an actual LXD issue.
I do however have one of my own servers running into this about once a week, exact same symptoms as you described and exact same frustration at not having anything printed on the console.
Can you file a bug at https://launchpad.net/ubuntu/+source/linux/+filebug and post the link to it here?
It's going to be a very hard one to track down but I will provide any help I can since I'm also affected.
Note that in my case, I'm nowhere near running out of memory on that system, CPU isn't busy, storage is NVME SSD without any reported issue and I've actually replaced the entire hardware (moved to another server) and still run into this.
Required information
Issue description
I have a few hosts as LXD hosts, around 40 total
I have been updating them to the latest upgrades regularly, but since I upgrded them to Ubuntu 18.04, from time to time the server gets stucked, I have to login to the remote console using iLO/iDRAC and reboot the server since it doesn't respond to even keystrokes to be able to do a console login
On the console there are no messages
on /var/log/syslog there is no relevant message, except the last one before getting stucked and the first one after reboot
For example:
There is nothing else installed or running on the hosts except LXD and a few containers in each hosts. It happens randomly (so far I haven't been able to identify the cause) and if I am not mistaen it started happening when I upgraded from 16.04 to 18.04
At first I thought there was some kernel issue and there were some messages in the console regarding some clues about this. But now there is nothing.
What else could I try to see in which logs to try to get a clue of what is happening? I am not very used to explore system/kernel logs. That is why I ask for help.
LXD log doesn't show anything relevant either. These are the last messages of previous log file lxd.log.1 :
dmesg, if I am not incorrect, it shows only the information since booting
The isue is happenign on different hardware, I have HP ProLiant DL380p from 24GB of ram to 48GB of ram, Dell servers from 16GB of ram to 128GB of ram, blade servers with 4GB of RAM
It seems to happen more often in some servers that may be overloaded in regards to memory assigned for the running LXD continers (after all the respective containers are up there is hardly any memory left on the host. Could it be that SWAP is finished?
but last night it happened on a server that on average has a few GB free and no major CPU consuming container is running there.
As I mention, only LXD containers I have in the hosts, nothing else. in some I have a few KVMs, but hag ups happen mostly on servers where there is none of these.
Any help would be appreciated
Thanks
Steps to reproduce
NOT being able so far
Information to attach
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)This is the only software that I have installed on the hosts: