Closed Freeaqingme closed 3 years ago
I figured I might be able to attach a debugger to see what's going on. However, unfortunately, gdb doesn't know what to do with Go. Also, Delve complains that it "could not open debug info" (presumably because of missing DWARF headers).
Having said that, for one reason or another* everything just seems to have resolved.
I realize this issue reads a little like a support issue, but I think that the fact that the socket can become unresponsive and make anything that connects to it unresponsive is a bit of a bug. But it may be quite difficult to find out what was causing it.
There was one thing that I changed. I have this script running every 1 second to transfer some files to my VM:
#!/bin/bash
vmName=puppet1
status=$(lxc info $vmName)
if [ $? != 0 ]; then
# vm not running
exit
fi
codePath=/etc/puppetlabs/code/environments/dev
lastModifiedTimestamp=$(find $codePath -type f -printf '%T@ %p\n' | \
sort -n | tail -1 | cut -d'.' -f1)
lastFewSecondsTimestamp=$(echo $(date +%s)-5 | bc)
if (( $lastModifiedTimestamp > $lastFewSecondsTimestamp )); then
echo "Files modified in last 5 seconds, syncing..."
tar zcf - -C $codePath . | lxc exec $vmName -- tar zxvf - -C $codePath >/dev/null
echo "Syncing complete!"
fi
The issue seemed to resolve itself right when I put an 'exit 0' statement on line 2. Perhaps that may or may not give a hint as to where the issue originates?
Can you show the output of dmesg
and ps fauxww
?
Per the output above, LXD itself is responding, but then gets locked up when trying to access the instances, so this smells like a kernel or hardware issue.
@stgraber dmesg is - for that time period - empty. Just to give an idea:
# dmesg -H -T | tail
[Tue Aug 10 08:12:45 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:12:45 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:12:58 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:12:58 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:13:11 2021] IPv4: martian source 10.0.2.15 from 91.189.92.40, on dev eth0
[Tue Aug 10 08:13:11 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 08:56:25 2021] IPv4: martian source 10.0.2.15 from 192.168.1.1, on dev eth0
[Tue Aug 10 08:56:25 2021] ll header: 00000000: 08 00 27 5e 84 f7 52 54 00 12 35 02 08 00
[Tue Aug 10 11:35:02 2021] device brVM entered promiscuous mode
[Tue Aug 10 11:35:04 2021] device brVM left promiscuous mode
# dmesg -H -T | grep -i 'lxc\|lxd'
#
I didn't do ps fauxww while the problem was going on. I did however do this:
# systemctl status
● hpv1
State: degraded
Jobs: 1 queued
Failed: 1 units
Since: Tue 2021-08-10 08:03:14 UTC; 3h 23min ago
CGroup: /
├─ 1717 lxcfs /var/snap/lxd/common/var/lib/lxcfs -p /var/snap/lxd/common/lxcfs.pid
├─ 2094 /snap/lxd/20987/bin/virtiofsd --socket-path=/var/snap/lxd/common/lxd/logs/puppet1-dev-formicidae>
├─ 2201 /snap/lxd/20987/bin/qemu-system-x86_64 -S -name puppet1-dev-formicidae-holdings -uuid a4b84c16-6>
├─ 2206 /snap/lxd/20987/bin/virtiofsd --socket-path=/var/snap/lxd/common/lxd/logs/puppet1-dev-formicidae>
├─54259 /bin/sh /snap/lxd/21260/commands/daemon.start
├─54818 lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─irqbalance.service
│ └─1061 /usr/sbin/irqbalance --foreground
├─haveged.service
│ └─955 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
├─systemd-networkd.service
│ └─581 /lib/systemd/systemd-networkd
├─systemd-udevd.service
│ ├─ 381 /lib/systemd/systemd-udevd
│ ├─255232 /lib/systemd/systemd-udevd
│ ├─255233 /lib/systemd/systemd-udevd
│ ├─255240 /lib/systemd/systemd-udevd
│ ├─255241 /lib/systemd/systemd-udevd
│ ├─255242 /lib/systemd/systemd-udevd
│ ├─255243 /lib/systemd/systemd-udevd
│ ├─255244 /lib/systemd/systemd-udevd
│ ├─255245 /lib/systemd/systemd-udevd
│ ├─255246 /lib/systemd/systemd-udevd
│ ├─255247 /lib/systemd/systemd-udevd
│ ├─255248 /lib/systemd/systemd-udevd
│ ├─255249 /lib/systemd/systemd-udevd
│ ├─255250 /lib/systemd/systemd-udevd
│ ├─255251 /lib/systemd/systemd-udevd
│ ├─255252 /lib/systemd/systemd-udevd
│ ├─255253 /lib/systemd/systemd-udevd
│ ├─255254 /lib/systemd/systemd-udevd
│ ├─255255 /lib/systemd/systemd-udevd
│ ├─255262 /lib/systemd/systemd-udevd
│ └─255300 /lib/systemd/systemd-udevd
├─cron.service
│ └─1396 /usr/sbin/cron -f
├─unbound.service
│ └─1557 /usr/sbin/unbound -d
├─polkit.service
│ └─64551 /usr/lib/policykit-1/polkitd --no-debug
├─auditd.service
│ └─916 /sbin/auditd
├─systemd-journald.service
│ └─357 /lib/systemd/systemd-journald
├─ssh.service
│ ├─ 1431 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
│ ├─206609 sshd: vagrant [priv]
│ ├─206798 sshd: vagrant@pts/0
│ ├─206799 -bash
│ ├─227635 doas -s
│ ├─227636 /bin/bash
│ ├─234356 sshd: vagrant [priv]
│ ├─234423 sshd: vagrant@pts/3
│ ├─234424 -bash
│ ├─234472 doas -s
│ ├─234473 /bin/bash
│ ├─281819 systemctl status
│ └─281820 pager
├─snapd.service
│ └─5936 /usr/lib/snapd/snapd
├─sync-puppet-code.service
│ ├─281779 /bin/bash /usr/local/bin/syncPuppetCode.sh
│ ├─281814 /bin/bash /usr/local/bin/syncPuppetCode.sh
│ ├─281815 find /etc/puppetlabs/code/environments/dev -type f -printf %T@ %p
n
│ ├─281816 sort -n
│ ├─281817 tail -1
│ └─281818 cut -d. -f1
├─rsyslog.service
│ └─1063 /usr/sbin/rsyslogd -n -iNONE
├─system-openvpn.slice
│ └─openvpn@ipv6.service
│ └─1374 /usr/sbin/openvpn --daemon ovpn-ipv6 --status /run/openvpn/ipv6.status 10 --cd /etc/openvpn>
├─system-arpwatch.slice
│ ├─arpwatch@lo.service
│ │ └─1433 /usr/sbin/arpwatch -u arpwatch -i lo -f lo.dat -N -p -F
│ ├─arpwatch@eth0.service
│ │ └─1434 /usr/sbin/arpwatch -u arpwatch -i eth0 -f eth0.dat -N -p -F
│ └─arpwatch@brVM.service
│ └─63507 /usr/sbin/arpwatch -u arpwatch -i brVM -f brVM.dat -N -p -F
├─snap.lxd.lxc.9d69383b-ff43-4f2a-9690-a703b3c4d27d.scope
│ └─227644 lxc exec puppet1-dev-formicidae-holdings bash
├─zfs-zed.service
│ └─1071 /usr/sbin/zed -F
├─ntp.service
│ └─1421 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 113:123
├─dbus.service
│ └─1056 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation -->
├─system-getty.slice
│ └─getty@tty1.service
│ └─1443 /sbin/agetty -o -p -- \u --noclear tty1 linux
├─puppet.service
│ └─1375 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent --no-daemonize
├─ifup@eth0.service
│ └─993 /sbin/dhclient -1 -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases ->
├─virtualbox-guest-utils.service
│ └─1192 /usr/sbin/VBoxService
└─systemd-logind.service
└─1068 /lib/systemd/systemd-logind
Re the label 'incomplete'. I think this issue maybe hardly actionable, other than perhaps some error handling or timeout to the call to the socket. But perhaps we can figure out what caused it :)
Unfortunately the systemctl
output doesn't provide me with the information I'd have liked to see (process state).
I agree that given the issue went away, there's not a whole lot we can do about it at this point :)
Based on your log output, LXD itself was responding to the API calls, so it wasn't a case of the socket not responding or the client not giving you a connection timeout. Instead the API would get stuck when internally accessing instance state or when attempting to execute a command.
This suggests that one of your instances was very very stuck as internally we have a 10s per instance timeout which would normally get us going. The only case where we can't move on is if we just go ourselves stuck in the kernel. This is usually visible in a ps fauxww
output as there being a bunch of processes in uninterruptible I/O state (D). If LXD somehow gets stuck reading from such a process, this kind of lock up can happen.
None of that is normal though and it's usually an indication of something being quite off with the kernel or the system being very badly bottlenecked by something. Most often it's I/O starvation or running out of memory, though there can also be purely software reasons for it (like dealing with a fork bomb or misbehaving software that acts much like one).
I'm going to close the issue for now. But if this shows up again, please comment with ps fauxww
and dmesg
at the time this hits. For good measure, also capture df -h
, free -m
and uptime
to get a bit of an idea of the state of the system.
I'm not entirely sure if this issue pertains to LXC or LXD. I just wanted to enter a VM using
lxc exec <vmName> bash
. That command hangs consistently without any output shown.lxc ls
also hangs with no output whatsoever (unless in debug mode):I think that for a command like this there could/should be a timeout on the call to the unix.socket, but that obviously wouldn't fix the problem. How should I go about fixing it and making things responsive again? It's a test environment so I could easily just reboot something, but as it is a lab set up for a future production environment I'd like to be able to fix it in a way I would also do on production in the future.
Required information
I
Information to attach
lxd.log: