canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

Can't interact with LXC #8503

Closed respadas closed 3 years ago

respadas commented 3 years ago

Required information

Issue description

The system is uptime since september 11 and it's a production server, all worked fine until today. We can't use lxc for nothing, always ends with Error: Get http://unix.socket/1.0: EOF there's not exist new changes with configurations, the journalctl output has basically three messages:

Feb 26 13:08:22 LXD-nodo1 systemd[1]: lxd.service: Found left-over process 25238 (lxd) in control group while starting unit. Ignoring. Feb 26 13:08:22 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 13:08:22 LXD-nodo1 lxd[22448]: t=2021-02-26T13:08:22-0600 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." Feb 26 13:09:48 LXD-nodo1 lxd[22448]: t=2021-02-26T13:09:48-0600 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection: no available dqlite leader server found" Feb 26 13:10:01 LXD-nodo1 lxd[22448]: t=2021-02-26T13:10:01-0600 lvl=warn msg="Failed connecting to global database (attempt 7): failed to create dqlite connection: no available dqlite leader server found"

Steps to reproduce

  1. Step one
  2. Step two
  3. Step three

Information to attach

stgraber commented 3 years ago

Is that a cluster? I'm assuming you already tried systemctl restart lxd?

respadas commented 3 years ago

Hi,

is not a cluster and yes I runned the restart but the commando sticks, If I check the status have this output:

root@LXD-nodo1:~# systemctl status lxd ● lxd.service - LXD - main daemon Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled) Active: activating (start-post) since Fri 2021-02-26 14:23:49 CST; 47s ago Docs: man:lxd(1) Process: 989 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS) Main PID: 999 (lxd); Control PID: 1000 (lxd) Tasks: 38 CGroup: /system.slice/lxd.service ├─ 785 [lxc monitor] /var/lib/lxd/containers ├─ 999 /usr/lib/lxd/lxd --group lxd --logfile=/var/log/lxd/lxd.log ├─ 1000 /usr/lib/lxd/lxd waitready --timeout=600 ├─ 1508 [lxc monitor] /var/lib/lxd/containers ├─ 2931 [lxc monitor] /var/lib/lxd/containers ├─ 3691 [lxc monitor] /var/lib/lxd/containers ├─ 7871 [lxc monitor] /var/lib/lxd/containers ├─10423 [lxc monitor] /var/lib/lxd/containers ├─11942 [lxc monitor] /var/lib/lxd/containers ├─15347 [lxc monitor] /var/lib/lxd/containers ├─18179 [lxc monitor] /var/lib/lxd/containers ├─19380 [lxc monitor] /var/lib/lxd/containers ├─19626 [lxc monitor] /var/lib/lxd/containers ├─19695 [lxc monitor] /var/lib/lxd ├─19732 dnsmasq --strict-order --bind-interfaces --pid-file=/var/lib/lxd/networks/lxdfan0/dnsmasq.pid --except-interface=lo --interface=lxdfan0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address ├─25238 [lxc monitor] /var/lib/lxd/containers ├─28249 [lxc monitor] /var/lib/lxd/containers ├─30026 [lxc monitor] /var/lib/lxd/containers ├─32200 [lxc monitor] /var/lib/lxd/containers └─32552 [lxc monitor] /var/lib/lxd/containers

Feb 26 14:23:49 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 14:23:49 LXD-nodo1 systemd[1]: lxd.service: Found left-over process 28249 (lxd) in control group while starting unit. Ignoring. Feb 26 14:23:49 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 14:23:49 LXD-nodo1 systemd[1]: lxd.service: Found left-over process 10423 (lxd) in control group while starting unit. Ignoring. Feb 26 14:23:49 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 14:23:49 LXD-nodo1 systemd[1]: lxd.service: Found left-over process 19626 (lxd) in control group while starting unit. Ignoring. Feb 26 14:23:49 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 14:23:49 LXD-nodo1 systemd[1]: lxd.service: Found left-over process 25238 (lxd) in control group while starting unit. Ignoring. Feb 26 14:23:49 LXD-nodo1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Feb 26 14:23:49 LXD-nodo1 lxd[999]: t=2021-02-26T14:23:49-0600 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."

stgraber commented 3 years ago

Can you show ps fauxww and sqlite3 /var/lib/lxd/database/local.db "SELECT * FROM raft_nodes;"?

respadas commented 3 years ago

Sure!

ps_output.log root@LXD-nodo1:~# sqlite3 /var/lib/lxd/database/local.db "SELECT * FROM raft_nodes;" 1|10.0.2.4:8443

stgraber commented 3 years ago

Right, so your system is setup as a cluster, a one node cluster but still a cluster. Is your machine actually reachable at 10.0.2.4?

respadas commented 3 years ago

Hi, yes it's reachable.

stgraber commented 3 years ago

What does nc -v 10.0.2.4 8443 get you?

respadas commented 3 years ago

root@LXD-nodo1:~# nc -v 10.0.2.4 8443 Connection to 10.0.2.4 8443 port [tcp/*] succeeded!

stgraber commented 3 years ago

Okay, that's odd.

Can you do:

See what that get stuck on?

respadas commented 3 years ago

root@LXD-nodo1:~# sudo lxd --debug --group lxd DBUG[02-26|18:21:29] Connecting to a local LXD over a Unix socket DBUG[02-26|18:21:29] Sending request to LXD method=GET url=http://unix.socket/1.0 etag= INFO[02-26|18:21:29] LXD 3.0.3 is starting in normal mode path=/var/lib/lxd INFO[02-26|18:21:29] Kernel uid/gid map: INFO[02-26|18:21:29] - u 0 0 4294967295 INFO[02-26|18:21:29] - g 0 0 4294967295 INFO[02-26|18:21:29] Configured LXD uid/gid map: INFO[02-26|18:21:29] - u 0 100000 65536 INFO[02-26|18:21:29] - g 0 100000 65536 WARN[02-26|18:21:29] CGroup memory swap accounting is disabled, swap limits will be ignored. INFO[02-26|18:21:29] Kernel features: INFO[02-26|18:21:29] - netnsid-based network retrieval: yes INFO[02-26|18:21:29] - unprivileged file capabilities: yes INFO[02-26|18:21:29] Initializing local database DBUG[02-26|18:21:29] Initializing database gateway DBUG[02-26|18:21:29] Connecting to a local LXD over a Unix socket DBUG[02-26|18:21:29] Sending request to LXD method=GET url=http://unix.socket/1.0 etag= DBUG[02-26|18:21:29] Detected stale unix socket, deleting DBUG[02-26|18:21:29] Detected stale unix socket, deleting INFO[02-26|18:21:29] Starting /dev/lxd handler: INFO[02-26|18:21:29] - binding devlxd socket socket=/var/lib/lxd/devlxd/sock INFO[02-26|18:21:29] REST API daemon: INFO[02-26|18:21:29] - binding Unix socket socket=/var/lib/lxd/unix.socket INFO[02-26|18:21:29] - binding TCP socket socket=[::]:8443 INFO[02-26|18:21:29] Initializing global database DBUG[02-26|18:21:29] Found cert k=0 DBUG[02-26|18:21:29] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=0 DBUG[02-26|18:21:29] Dqlite: connection failed err=no available dqlite leader server found attempt=0 DBUG[02-26|18:21:29] Found cert k=0 DBUG[02-26|18:21:29] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=1 DBUG[02-26|18:21:29] Dqlite: connection failed err=no available dqlite leader server found attempt=1 DBUG[02-26|18:21:30] Found cert k=0 DBUG[02-26|18:21:30] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=2 DBUG[02-26|18:21:30] Dqlite: connection failed err=no available dqlite leader server found attempt=2 DBUG[02-26|18:21:30] Found cert k=0 DBUG[02-26|18:21:30] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=3 DBUG[02-26|18:21:30] Dqlite: connection failed err=no available dqlite leader server found attempt=3 DBUG[02-26|18:21:31] Found cert k=0 DBUG[02-26|18:21:31] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=4 DBUG[02-26|18:21:31] Dqlite: connection failed err=no available dqlite leader server found attempt=4 DBUG[02-26|18:21:32] Found cert k=0 DBUG[02-26|18:21:32] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=5 DBUG[02-26|18:21:32] Dqlite: connection failed err=no available dqlite leader server found attempt=5 DBUG[02-26|18:21:33] Found cert k=0 DBUG[02-26|18:21:33] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=6 DBUG[02-26|18:21:33] Dqlite: connection failed err=no available dqlite leader server found attempt=6 DBUG[02-26|18:21:34] Found cert k=0 DBUG[02-26|18:21:34] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=7 DBUG[02-26|18:21:34] Dqlite: connection failed err=no available dqlite leader server found attempt=7 DBUG[02-26|18:21:35] Found cert k=0 DBUG[02-26|18:21:35] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=8 DBUG[02-26|18:21:35] Dqlite: connection failed err=no available dqlite leader server found attempt=8 DBUG[02-26|18:21:36] Found cert k=0 DBUG[02-26|18:21:36] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=9 DBUG[02-26|18:21:36] Dqlite: connection failed err=no available dqlite leader server found attempt=9 DBUG[02-26|18:21:37] Found cert k=0 DBUG[02-26|18:21:37] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=10 DBUG[02-26|18:21:37] Dqlite: connection failed err=no available dqlite leader server found attempt=10 DBUG[02-26|18:21:38] Found cert k=0 DBUG[02-26|18:21:38] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=11 DBUG[02-26|18:21:38] Dqlite: connection failed err=no available dqlite leader server found attempt=11 DBUG[02-26|18:21:39] Failed connecting to global database (attempt 0): failed to create dqlite connection: no available dqlite leader server found DBUG[02-26|18:21:41] Found cert k=0 DBUG[02-26|18:21:41] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=0 DBUG[02-26|18:21:41] Dqlite: connection failed err=no available dqlite leader server found attempt=0 DBUG[02-26|18:21:42] Found cert k=0 DBUG[02-26|18:21:42] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=1 DBUG[02-26|18:21:42] Dqlite: connection failed err=no available dqlite leader server found attempt=1 DBUG[02-26|18:21:42] Found cert k=0 DBUG[02-26|18:21:42] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=2 DBUG[02-26|18:21:42] Dqlite: connection failed err=no available dqlite leader server found attempt=2 DBUG[02-26|18:21:42] Found cert k=0 DBUG[02-26|18:21:42] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=3 DBUG[02-26|18:21:42] Dqlite: connection failed err=no available dqlite leader server found attempt=3 DBUG[02-26|18:21:43] Found cert k=0 DBUG[02-26|18:21:43] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=4 DBUG[02-26|18:21:43] Dqlite: connection failed err=no available dqlite leader server found attempt=4 DBUG[02-26|18:21:44] Found cert k=0 DBUG[02-26|18:21:44] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=10.0.2.4:8443 attempt=5 DBUG[02-26|18:21:44] Dqlite: connection failed err=no available dqlite leader server found attempt=5

stgraber commented 3 years ago

Just to be sure there's nothing weird going on, what does ip -4 route get 10.0.2.4 get you?

respadas commented 3 years ago

root@LXD-nodo1:~# ip -4 route get 10.0.2.4 local 10.0.2.4 dev lo src 10.0.2.4 uid 0 cache

stgraber commented 3 years ago

Can you send me a tarball of /var/lib/lxd/database at stgraber at ubuntu dot com? I'll try to reproduce the issue here and get it back online.

It'd have been pretty trivial to resolve this on LXD 4.0 but 3.0 is missing much of the newer clustering tooling.

respadas commented 3 years ago

Files send, thank you.

stgraber commented 3 years ago

Sent you a manually re-created version of your database with clustering disabled. It's loading fine here and I can see your containers (24 of them, using btrfs storage pool).

respadas commented 3 years ago

Thank you a lot, the re-created database is working fine.

Regards,