canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.33k stars 926 forks source link

LXD instance fails to start, can't open cluster database - but is not a cluster #6036

Closed rigred closed 5 years ago

rigred commented 5 years ago

Issue description

LXD instance wont start from systemd service or from snap, gives below log info. Only starting manually with lxd --group lxd works.

Aug 01 19:14:53 hetz-hel1 lxd.daemon[6734]: => Preparing the system
Aug 01 19:14:53 hetz-hel1 lxd.daemon[6734]: ==> Loading snap configuration
Aug 01 19:14:53 hetz-hel1 lxd.daemon[6734]: ==> Setting up mntns symlink (mnt:[4026532298])
Aug 01 19:14:53 hetz-hel1 lxd.daemon[6734]: ==> Setting up kmod wrapper
Aug 01 19:14:53 hetz-hel1 lxd.daemon[6734]: ==> Preparing /boot
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Preparing a clean copy of /run
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Preparing a clean copy of /etc
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Setting up ceph configuration
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Setting up LVM configuration
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Rotating logs
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Setting up ZFS (0.7)
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Escaping the systemd cgroups
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ====> Detected cgroup V1
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Escaping the systemd process resource limits
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Enabling CRIU
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: ==> Disabling shiftfs on this kernel (auto)
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: => Re-using existing LXCFS
Aug 01 19:14:54 hetz-hel1 lxd.daemon[6734]: => Starting LXD
Aug 01 19:15:01 hetz-hel1 lxd.daemon[6734]: t=2019-08-01T19:15:01+0200 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: database is locked"
Aug 01 19:15:01 hetz-hel1 lxd.daemon[6734]: Error: failed to open cluster database: failed to ensure schema: failed to update node version info: database is locked
Aug 01 19:15:01 hetz-hel1 lxd.daemon[6734]: => LXD failed to start

Steps to reproduce

  1. Attempt to start/restart mysteriously failed LXD snap via systemd service or snap restart/start lxd

Notes

  1. Staring lxd daemon manually via lxd --group lxd works as normal.
  2. Can send dump of /var/snap/lxd/common/lxd/database
stgraber commented 5 years ago

Hmm, I don't get why starting it directly with lxd --group lxd would work but through systemd wouldn't, normally both should fail if it's not a random issue which isn't persistent.

@freeekanayaka

freeekanayaka commented 5 years ago

@stgraber I presume that lxd --group lxd is using /var/lib/lxd vs /var/snap/lxd/common/xd ?

freeekanayaka commented 5 years ago

@rigred can you send us the /var/snap/lxd/common/lxd/logs/lxd.log too?

rigred commented 5 years ago

I don't think it's using /var/lib/lxd as nothing's there that's recently been touched.

ls -lah /var/lib/lxd
drwxr-xr-x  4 lxd  nogroup    5 Jul 31 12:00 .
drwxr-xr-x 33 root root      33 Jul 31 12:13 ..
drwxr-xr-x  2 root root       2 Aug 27  2018 devlxd
-rw-r--r--  1 root root    1.9K Aug 27  2018 server.crt
drwx--x--x  2 root root       2 Aug 27  2018 shmounts

And based on journalctl data it's definetly running out of /var/snap/lxd/common/lxd/

The log you asked for is quite limited, I've replaced the public IP with IP_ADDR.

t=2019-08-01T20:27:29+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-08-01T20:27:29+0200 lvl=info msg="Kernel uid/gid map:" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - u 0 0 4294967295" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - g 0 0 4294967295" 
t=2019-08-01T20:27:29+0200 lvl=info msg="Configured LXD uid/gid map:" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - u 0 1000000 1000000000" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - g 0 1000000 1000000000" 
t=2019-08-01T20:27:29+0200 lvl=info msg="Kernel features:" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - netnsid-based network retrieval: no" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - uevent injection: no" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - seccomp listener: no" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - shiftfs support: no" 
t=2019-08-01T20:27:29+0200 lvl=info msg="Initializing local database" 
t=2019-08-01T20:27:29+0200 lvl=info msg="Starting /dev/lxd handler:" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-08-01T20:27:29+0200 lvl=info msg="REST API daemon:" 
t=2019-08-01T20:27:29+0200 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-08-01T20:27:29+0200 lvl=info msg=" - binding TCP socket" socket=IP_ADDR:8443
t=2019-08-01T20:27:29+0200 lvl=info msg="Initializing global database" 
t=2019-08-01T20:27:36+0200 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: database is locked" 
t=2019-08-01T20:27:36+0200 lvl=info msg="Starting shutdown sequence" 
t=2019-08-01T20:27:36+0200 lvl=info msg="Stopping REST API handler:" 
t=2019-08-01T20:27:36+0200 lvl=info msg=" - closing socket" socket=IP_ADDR:8443
t=2019-08-01T20:27:36+0200 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-08-01T20:27:36+0200 lvl=info msg="Stopping /dev/lxd handler:" 
t=2019-08-01T20:27:36+0200 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock

Where can I email database info instead of here? I can send you the full output of lxd --group lxd that prints A LOT of info.

rigred commented 5 years ago

Just for clarity this system doesn't have any apt lxd/lxc/lxcfs remnants. lxd points to /snap/bin/lxd

freeekanayaka commented 5 years ago

You can email the database to me at free dot ekanayaka at canonical dot com.

One possibility is that the lxd daemon is racing with something else (perhaps lxd wait) when started using systemd/snapd. Would it be possible for you to make sure that absolutely no lxd-related process is running and then try again to start the systemd unit? systemctl start snap.lxd.daemon

rigred commented 5 years ago

So I have figured out what some of the problem was.

The nodes table in db.bin for some reason had an entry that was pointing at the public IP of the server instead of 0.0.0.0 as it does on a good local install.

This apparently confused the ?newly? added cluster check.

Similarly raft_nodes in local.db contained an entry for the systems public IP with port 8443. So what I did

  1. Removed the WAL files
  2. changed the nodes table entry to use 0.0.0.0 and none for name.
  3. Removed the raft_nodes entry.

However I now have a new issue of the following type

t=2019-08-06T13:30:51+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-08-06T13:30:51+0200 lvl=info msg="Kernel uid/gid map:" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - u 0 0 4294967295" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - g 0 0 4294967295" 
t=2019-08-06T13:30:51+0200 lvl=info msg="Configured LXD uid/gid map:" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - u 0 1000000 1000000000" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - g 0 1000000 1000000000" 
t=2019-08-06T13:30:51+0200 lvl=info msg="Kernel features:" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - uevent injection: yes" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - seccomp listener: yes" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - shiftfs support: no" 
t=2019-08-06T13:30:51+0200 lvl=info msg="Initializing local database" 
t=2019-08-06T13:30:51+0200 lvl=info msg="Starting /dev/lxd handler:" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-08-06T13:30:51+0200 lvl=info msg="REST API daemon:" 
t=2019-08-06T13:30:51+0200 lvl=info msg=" - binding Unix socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-08-06T13:30:51+0200 lvl=info msg=" - binding TCP socket" socket=[::]:8443
t=2019-08-06T13:30:51+0200 lvl=info msg="Initializing global database" 
t=2019-08-06T13:32:16+0200 lvl=warn msg="Failed connecting to global database (attempt 6): failed to create dqlite connection: no available dqlite leader server found" 
---- Keeps repeating  prior line over and over -----

The LXD daemon starts and listens on [::]8443 but apparently fails to connect to the the dqlite leader server which I presume is some other cluster remnant?

Running lxd --debug --group lxd prints much the same info without extra clues.

Here's a dump file of the local.db

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE schema (
    id         INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    version    INTEGER NOT NULL,
    updated_at DATETIME NOT NULL,
    UNIQUE (version)
);
INSERT INTO schema VALUES(1,37,1535897343);
INSERT INTO schema VALUES(2,38,1565087616);
CREATE TABLE config (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    key VARCHAR(255) NOT NULL,
    value TEXT,
    UNIQUE (key)
);
INSERT INTO config VALUES(4,'core.https_address','[::]:8443');
CREATE TABLE patches (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    name VARCHAR(255) NOT NULL,
    applied_at DATETIME NOT NULL,
    UNIQUE (name)
);
INSERT INTO patches VALUES(1,'invalid_profile_names',1535897343);
INSERT INTO patches VALUES(2,'leftover_profile_config',1535897343);
INSERT INTO patches VALUES(3,'network_permissions',1535897343);
INSERT INTO patches VALUES(4,'storage_api',1535897343);
INSERT INTO patches VALUES(5,'storage_api_v1',1535897343);
INSERT INTO patches VALUES(6,'storage_api_dir_cleanup',1535897343);
INSERT INTO patches VALUES(7,'storage_api_lvm_keys',1535897343);
INSERT INTO patches VALUES(8,'storage_api_keys',1535897343);
INSERT INTO patches VALUES(9,'storage_api_update_storage_configs',1535897343);
INSERT INTO patches VALUES(10,'storage_api_lxd_on_btrfs',1535897343);
INSERT INTO patches VALUES(11,'storage_api_lvm_detect_lv_size',1535897343);
INSERT INTO patches VALUES(12,'storage_api_insert_zfs_driver',1535897343);
INSERT INTO patches VALUES(13,'storage_zfs_noauto',1535897343);
INSERT INTO patches VALUES(14,'storage_zfs_volume_size',1535897343);
INSERT INTO patches VALUES(15,'network_dnsmasq_hosts',1535897343);
INSERT INTO patches VALUES(16,'storage_api_dir_bind_mount',1535897343);
INSERT INTO patches VALUES(17,'fix_uploaded_at',1535897343);
INSERT INTO patches VALUES(18,'storage_api_ceph_size_remove',1535897343);
INSERT INTO patches VALUES(19,'devices_new_naming_scheme',1535897343);
INSERT INTO patches VALUES(20,'storage_api_permissions',1535897343);
INSERT INTO patches VALUES(21,'container_config_regen',1535897343);
INSERT INTO patches VALUES(22,'lvm_node_specific_config_keys',1535897343);
INSERT INTO patches VALUES(23,'candid_rename_config_key',1535897343);
INSERT INTO patches VALUES(24,'move_backups',1537189625);
INSERT INTO patches VALUES(25,'storage_api_rename_container_snapshots_dir',1539569643);
INSERT INTO patches VALUES(26,'shrink_logs_db_file',1541542179);
INSERT INTO patches VALUES(27,'storage_api_rename_container_snapshots_links',1542063856);
CREATE TABLE raft_nodes (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    address TEXT NOT NULL,
    UNIQUE (address)
);
DELETE FROM sqlite_sequence;
INSERT INTO sqlite_sequence VALUES('schema',2);
INSERT INTO sqlite_sequence VALUES('patches',27);
INSERT INTO sqlite_sequence VALUES('config',4);
COMMIT;

This is from the global/db.bin

SELECT * FROM nodes;
1|none||0.0.0.0|14|133|2019-07-18 05:15:38.498156731+00:00|0
SELECT * FROM networks_nodes;
4|4|1
SELECT * FROM networks_config;
46|4||ipv4.address|10.197.134.1/24
47|4||ipv4.firewall|false
48|4||ipv4.nat|true
49|4||ipv6.address|fd42:52a6:f1cf:6e7a::1/64
50|4||ipv6.nat|true

Let me know if there's something else you need / or if I'm even on the right track.

freeekanayaka commented 5 years ago

A few questions:

  1. What do you mean exactly with "Removed the WAL files"? What files did you remove?
  2. What do you mean exactly with "changed the nodes table entry to use 0.0.0.0 and none for name"? How did you do that?

Note that any modification to db.bin* is going to be a no-op, since those files are only dump of the raft state, and get rebuilt.

Also, if you had a row in the raft_nodes table, it means that this LXD instance was clustered (although the title of the issue says it was not). Perhaps it was clustered but the cluster had only one member (this instance itself)?

rigred commented 5 years ago

I have a copy of all original files just to make that clear. I have just emailed that to you

Regarding your questions:

  1. In database/global/ there presently exists:

    -rw-r--r-- 1 root root 452K Aug  6 12:45 db.bin
    -rw-r--r-- 1 root root 2.2M Aug  6 12:45 db.bin-wal
    -rw-r--r-- 1 root root  28M Aug  6 12:45 logs.db
    drwxr-xr-x 4 root root    4 Aug  6 12:45 snapshots

    The db.bin Write ahead log doesn't exist on my healthy systems (And I know it's not a good idea to remove a WAL) But I tried it anyway as a test. Restarting LXD normally rebuilt db.bin with the older values. Removing the WAL stopped the db.bin file being rebuilt with older values.

  2. stopped all LXD services and sockets

  3. opened global/db.bin with sqlite3, compared to local lxd reference

  4. Looked at nodes table

  5. ran UPDATE nodes SET address = "0.0.0.0" WHERE id = 1;

  6. commit changes and exit

  7. restart lxd

    
    This is purely me poking around in the dark, but this is something I cross referenced to this check
    https://github.com/lxc/lxd/blob/6decb1bfad667f23e7a5534b86df0efc096ca0a1/lxd/db/cluster/open.go#L112

But as you say, chances are that somehow this lxd instance was setup as some sort of strange single member cluster. (I did not setup this cluster originally, and once it broke it's hard to check the configuration without digging through the database).

rigred commented 5 years ago

So to solve this what I would like to accomplish is to uncluster this instance and prevent snap auto refreshes and do them manually only after I've verified against a local reference system.

I've had several lxd instances break with the 3.15 auto refresh (which were easily fixed) https://github.com/lxc/lxd/issues/6016 But this one testing instance this issue is about has been giving one issue after another, yet doesn't seem too functionally different to all the others.

stgraber commented 5 years ago

You can't disable clustering the way you've done it. The raft database itself has a record of the nodes, so modifying nodes/raft_nodes in that way, effectively just prevents it from connecting.

Mind sending your pre-modification database tarball to stgraber at ubuntu dot com?

I'll take a quick look at how much of the clustering was configured and what may be needed to have it come back online.

rigred commented 5 years ago

I've just sent one to freeekanayaka, sent to you also now.

stgraber commented 5 years ago

Thanks, I responded to your e-mail. I'm having your database load just fine here once I replicated the IP address of your node on my system. I'd recommend you:

If all that looks good, then we can try restarting the whole thing again and see what happens:

This will also enable debug logging just in case things go bad again.

rigred commented 5 years ago

Ok so snap list

Name  Version    Rev    Tracking  Publisher   Notes
core  16-2.39.3  7270   stable    canonical*  core
lxd   3.15       11437  stable    canonical*  -

Nothing odd in process tree

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    11:47   0:01 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [rcu_gp]
root         4  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [rcu_par_gp]
root         6  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/0:0H-kb]
root         8  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [mm_percpu_wq]
root         9  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/0]
root        10  0.0  0.0      0     0 ?        I    11:47   0:21  \_ [rcu_sched]
root        11  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/0]
root        12  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/0]
root        14  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/0]
root        15  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/1]
root        16  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/1]
root        17  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/1]
root        18  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/1]
root        20  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/1:0H-kb]
root        21  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/2]
root        22  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/2]
root        23  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/2]
root        24  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/2]
root        26  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/2:0H-kb]
root        27  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/3]
root        28  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/3]
root        29  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/3]
root        30  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/3]
root        32  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/3:0H-kb]
root        33  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/4]
root        34  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/4]
root        35  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/4]
root        36  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/4]
root        38  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/4:0H-ev]
root        39  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/5]
root        40  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/5]
root        41  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/5]
root        42  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/5]
root        44  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/5:0H-kb]
root        45  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/6]
root        46  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/6]
root        47  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/6]
root        48  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/6]
root        50  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/6:0H-kb]
root        51  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [cpuhp/7]
root        52  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [idle_inject/7]
root        53  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [migration/7]
root        54  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ksoftirqd/7]
root        56  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/7:0H-kb]
root        57  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [kdevtmpfs]
root        58  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [netns]
root        59  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [rcu_tasks_kthre]
root        60  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [kauditd]
root        62  0.0  0.0      0     0 ?        I    11:47   0:01  \_ [kworker/2:1-mm_]
root        63  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [khungtaskd]
root        64  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [oom_reaper]
root        65  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [writeback]
root        66  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [kcompactd0]
root        67  0.4  0.0      0     0 ?        SN   11:47   1:46  \_ [ksmd]
root        68  0.0  0.0      0     0 ?        SN   11:47   0:00  \_ [khugepaged]
root        69  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [crypto]
root        70  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kintegrityd]
root        71  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kblockd]
root        72  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [tpm_dev_wq]
root        73  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [ata_sff]
root        74  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [md]
root        75  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [edac-poller]
root        76  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [devfreq_wq]
root        77  0.0  0.0      0     0 ?        I    11:47   0:01  \_ [kworker/4:1-eve]
root        78  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [watchdogd]
root        83  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [kswapd0]
root        85  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [ecryptfs-kthrea]
root       176  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kthrotld]
root       178  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [acpi_thermal_pm]
root       179  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [ipv6_addrconf]
root       190  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kstrp]
root       209  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [charger_manager]
root       279  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_0]
root       280  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_0]
root       305  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_1]
root       306  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_1]
root       311  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_2]
root       313  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_2]
root       317  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_3]
root       320  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_3]
root       321  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_4]
root       322  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_4]
root       323  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [scsi_eh_5]
root       325  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [scsi_tmf_5]
root       332  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/2:1H]
root       333  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [i915/signal:0]
root       334  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [i915/signal:1]
root       335  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [i915/signal:2]
root       336  0.0  0.0      0     0 ?        S    11:47   0:00  \_ [i915/signal:6]
root       340  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/4:1H-kb]
root       341  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/7:1H-kb]
root       342  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/6:1H-kb]
root       418  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/3:1H-kb]
root       432  0.0  0.0      0     0 ?        I<   11:47   0:00  \_ [kworker/5:1H-kb]
root       449  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kworker/1:1H-kb]
root       464  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kdmflush]
root       466  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kcryptd_io/253:]
root       467  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kcryptd/253:0]
root       468  0.0  0.0      0     0 ?        S    11:48   0:09  \_ [dmcrypt_write/2]
root       486  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kdmflush]
root       488  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kcryptd_io/253:]
root       489  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kcryptd/253:1]
root       490  0.0  0.0      0     0 ?        S    11:48   0:08  \_ [dmcrypt_write/2]
root       563  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [spl_system_task]
root       564  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [spl_delay_taskq]
root       565  0.0  0.0      0     0 ?        S<   11:48   0:01  \_ [spl_dynamic_tas]
root       566  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [spl_kmem_cache]
root       577  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [zvol]
root       583  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [arc_prune]
root       584  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [arc_reclaim]
root       588  0.0  0.0      0     0 ?        S    11:48   0:01  \_ [dbu_evict]
root       589  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dbuf_evict]
root       592  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [z_vdev_file]
root       593  0.0  0.0      0     0 ?        S    11:48   0:01  \_ [l2arc_feed]
root       620  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [kworker/0:1H-kb]
root       688  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_null_iss]
root       689  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_null_int]
root       690  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_iss]
root       691  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_0]
root       692  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_1]
root       693  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_2]
root       694  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_3]
root       695  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_4]
root       696  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_5]
root       697  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_6]
root       698  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_rd_int_7]
root       699  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       700  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       701  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       702  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       703  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       704  0.0  0.0      0     0 ?        S<   11:48   0:04  \_ [z_wr_iss]
root       705  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_wr_iss_h]
root       706  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_0]
root       707  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_1]
root       708  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_2]
root       709  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_3]
root       710  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_4]
root       711  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_5]
root       712  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_6]
root       713  0.0  0.0      0     0 ?        S<   11:48   0:03  \_ [z_wr_int_7]
root       714  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_wr_int_h]
root       715  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_0]
root       716  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_1]
root       717  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_2]
root       718  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_3]
root       719  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_4]
root       720  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_5]
root       721  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_6]
root       722  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_iss_7]
root       723  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_fr_int]
root       724  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_cl_iss]
root       725  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_cl_int]
root       726  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [z_ioctl_iss]
root       727  0.0  0.0      0     0 ?        S<   11:48   0:12  \_ [z_ioctl_int]
root       728  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [z_zvol]
root       729  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [z_prefetch]
root       730  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [z_upgrade]
root       731  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [metaslab_group_]
root       737  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       738  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       739  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       740  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       741  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       742  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_sync_taskq]
root       743  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       744  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       745  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       746  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       747  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       748  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       749  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       750  0.0  0.0      0     0 ?        SN   11:48   0:00  \_ [dp_zil_clean_ta]
root       751  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [z_iput]
root       802  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [txg_quiesce]
root       803  0.0  0.0      0     0 ?        S    11:48   0:03  \_ [txg_sync]
root       804  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [mmp]
root      1353  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [irq/126-mei_me]
root      1504  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [loop1]
root      1507  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [loop2]
root      1509  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [loop3]
root      1511  0.0  0.0      0     0 ?        S<   11:48   0:00  \_ [loop4]
root      1523  0.0  0.0      0     0 ?        S    11:48   0:00  \_ [jbd2/sda1-8]
root      1524  0.0  0.0      0     0 ?        I<   11:48   0:00  \_ [ext4-rsv-conver]
root      1591  0.0  0.0   4504   544 ?        S    11:48   0:00  \_ bpfilter_umh
root      2790  0.0  0.0      0     0 ?        I    11:49   0:01  \_ [kworker/3:2-eve]
root      2966  0.0  0.0      0     0 ?        S<   11:49   0:00  \_ [z_rd_iss]
root      3587  0.0  0.0      0     0 ?        I<   11:49   0:00  \_ [cfg80211]
root     12768  0.0  0.0      0     0 ?        I    13:09   0:00  \_ [kworker/5:1-eve]
root     22733  0.0  0.0      0     0 ?        I    13:45   0:00  \_ [kworker/0:1-eve]
root     18519  0.0  0.0      0     0 ?        I<   17:15   0:01  \_ [kworker/u17:5-k]
root     23681  0.0  0.0      0     0 ?        I    17:20   0:00  \_ [kworker/7:1-eve]
root     25280  0.0  0.0      0     0 ?        I    17:20   0:00  \_ [kworker/1:0-eve]
root     11463  0.0  0.0      0     0 ?        I    17:35   0:00  \_ [kworker/2:2-rcu]
root     15558  0.0  0.0      0     0 ?        I<   17:51   0:00  \_ [kworker/u17:4-k]
root     16864  0.0  0.0      0     0 ?        I    17:56   0:00  \_ [kworker/4:2-cgr]
root     21029  0.0  0.0      0     0 ?        I<   18:06   0:00  \_ [kworker/u17:6-k]
root     23133  0.0  0.0      0     0 ?        I<   18:15   0:00  \_ [kworker/u17:2-k]
root     24948  0.0  0.0      0     0 ?        I    18:15   0:00  \_ [kworker/u16:0-e]
root     25787  0.0  0.0      0     0 ?        I    18:15   0:00  \_ [kworker/6:2-eve]
root     28189  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/u16:2-e]
root     28192  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/6:3-eve]
root     28193  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/5:2-eve]
root     28195  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/0:3-eve]
root     28212  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/1:1-mm_]
root     28214  0.0  0.0      0     0 ?        I    18:23   0:00  \_ [kworker/7:2-rcu]
root     28451  0.0  0.0      0     0 ?        I    18:24   0:00  \_ [kworker/3:1-rcu]
root     29478  0.0  0.0      0     0 ?        I<   18:24   0:00  \_ [xfsalloc]
root     29482  0.0  0.0      0     0 ?        I<   18:24   0:00  \_ [xfs_mru_cache]
root     29490  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsIO]
root     29491  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29492  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29493  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29494  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29495  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29496  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29497  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29498  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsCommit]
root     29499  0.0  0.0      0     0 ?        S    18:24   0:00  \_ [jfsSync]
root     32356  0.0  0.0      0     0 ?        I    18:33   0:00  \_ [kworker/u16:1-e]
root     32400  0.0  0.0      0     0 ?        I<   18:35   0:00  \_ [kworker/u17:0-k]
root         1  0.0  0.0 225760  7244 ?        Ss   11:47   0:05 /sbin/init nosplash
root      1102  0.0  0.1 169472 54716 ?        S<s  11:48   0:04 /lib/systemd/systemd-journald
root      1119  0.0  0.0 105900  1076 ?        Ss   11:48   0:00 /sbin/lvmetad -f
root      1137  0.0  0.0  46620  4016 ?        Ss   11:48   0:02 /lib/systemd/systemd-udevd
systemd+  1571  0.0  0.0 141920  3744 ?        Ssl  11:48   0:00 /lib/systemd/systemd-timesyncd
root      1617  0.0  0.0  31972  1236 ?        S<sl 11:48   0:00 /sbin/auditd
root      1777  0.0  0.0 167492 14512 ?        Ssl  11:48   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root      1778  0.0  0.0 434324  6436 ?        Ssl  11:48   0:00 /usr/sbin/ModemManager --filter-policy=strict
syslog    1779  0.0  0.0 263032  3240 ?        Ssl  11:48   0:02 /usr/sbin/rsyslogd -n
root      1780  0.0  0.0 2032460 22796 ?       Ssl  11:48   0:12 /usr/lib/snapd/snapd
root      1781  0.0  0.0  70596  4384 ?        Ss   11:48   0:00 /lib/systemd/systemd-logind
root      1782  0.0  0.0  26708  3868 ?        Ss   11:48   0:00 /usr/sbin/smartd -n
message+  1783  0.0  0.0  50372  3284 ?        Ss   11:48   0:01 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root      1792  0.0  0.0  45220  3776 ?        Ss   11:48   0:00 /sbin/wpa_supplicant -u -s -O /run/wpa_supplicant
root      1793  0.0  0.0 182968  2452 ?        Ssl  11:48   0:00 /usr/sbin/zed -F
root      1794  0.0  0.0 176916  6972 ?        Ssl  11:48   0:02 /usr/sbin/thermald --no-daemon --dbus-enable
root      1796  0.0  0.0 197024 28656 ?        Ssl  11:48   0:00 /usr/bin/python3 -Es /usr/sbin/firewalld --nofork --nopid
root      1797  0.0  0.0  28352  1988 ?        Ss   11:48   0:00 /usr/sbin/cron -f
root      1810  0.0  0.0 289416  6052 ?        Ssl  11:48   0:00 /usr/lib/policykit-1/polkitd --no-debug
systemd+  1940  0.0  0.0  80168  5336 ?        Ss   11:48   0:00 /lib/systemd/systemd-networkd
root      1942  0.0  0.0 404608 12604 ?        Ssl  11:48   0:01 /usr/sbin/NetworkManager --no-daemon
root      2109  0.0  0.0  25988  2644 ?        Ss   11:49   0:00 /sbin/dhclient -1 -4 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
root      2779  0.0  0.0 1801668 24424 ?       Ssl  11:49   0:01 /usr/sbin/libvirtd
root      2782  0.0  0.0 184328 17216 ?        Ssl  11:49   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root      2784  0.0  0.0 441388 18156 ?        Ssl  11:49   0:24 /usr/bin/python3 /usr/bin/fail2ban-server -xf start
sciswk    2787  0.0  0.0  76876  7220 ?        Ss   11:49   0:00 /lib/systemd/systemd --user
sciswk    2798  0.0  0.0 112000  2668 ?        S    11:49   0:00  \_ (sd-pam)
root      2791  0.0  0.0  14892  1912 tty1     Ss+  11:49   0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
root      2799  0.0  0.0  72292  4680 ?        Ss   11:49   0:00 /usr/sbin/sshd -D
root      7517  0.0  0.0 107980  6136 ?        Ss   11:51   0:00  \_ sshd: admin [priv]
admin     7562  0.0  0.0 108264  4164 ?        S    11:51   0:03      \_ sshd: admin@pts/0,pts/1,pts/2
admin     7563  0.0  0.0  20304  5172 pts/0    Ss   11:51   0:00          \_ -bash
root     31797  0.0  0.0  60576  3716 pts/0    S    18:27   0:00          |   \_ sudo su
root     31798  0.0  0.0  60080  3416 pts/0    S    18:27   0:00          |       \_ su
root     31799  0.0  0.0  18636  3452 pts/0    S    18:27   0:00          |           \_ bash
root     32653  0.0  0.0  37096  3552 pts/0    R+   18:36   0:00          |               \_ ps fauxww
admin     1093  0.0  0.0  20316  5228 pts/1    Ss+  12:33   0:00          \_ -bash
admin    18082  0.0  0.0  20412  5380 pts/2    Ss+  13:18   0:00          \_ -bash
root      3099  0.0  0.0  67372  3584 ?        Ss   11:49   0:00 /usr/lib/postfix/sbin/master -w
postfix   3101  0.0  0.0  73852  4504 ?        S    11:49   0:00  \_ qmgr -l -t unix -u
postfix  30300  0.0  0.0  73812  4600 ?        S    18:26   0:00  \_ pickup -l -t unix -u -c
netdata   3103  0.5  0.1 208268 58492 ?        Ssl  11:49   2:25 /usr/sbin/netdata -D
netdata   9530  1.1  0.0  22068  4852 ?        S    15:49   1:55  \_ /usr/lib/x86_64-linux-gnu/netdata/plugins.d/apps.plugin 1
netdata  15182  0.1  0.0  18504  3132 ?        S    17:50   0:03  \_ bash /usr/lib/x86_64-linux-gnu/netdata/plugins.d/tc-qos-helper.sh 1
libvirt+  3922  0.0  0.0  49960  1972 ?        S    11:49   0:00 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_leaseshelper
root      3923  0.0  0.0  49932   368 ?        S    11:49   0:00  \_ /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_leaseshelper
admin     7521  0.0  0.0  76768  7452 ?        Ss   11:51   0:00 /lib/systemd/systemd --user
admin     7522  0.0  0.0 259688  2940 ?        S    11:51   0:00  \_ (sd-pam)
admin    22937  0.0  0.0  28900  2876 ?        Ss   17:20   0:00 SCREEN -R
admin    22938  0.0  0.0  19856  4728 pts/3    Ss+  17:20   0:00  \_ /bin/bash
systemd+ 11183  0.0  0.0  70752  4980 ?        Ss   17:35   0:00 /lib/systemd/systemd-resolved
lxd      31579  0.0  0.0  49980   396 ?        S    18:26   0:00 dnsmasq --strict-order --bind-interfaces --pid-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.pid --except-interface=lo --no-ping --interface=lxdbr0 --quiet-dhcp --quiet-dhcp6 --quiet-ra --listen-address=10.197.134.1 --dhcp-no-override --dhcp-authoritative --dhcp-leasefile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.leases --dhcp-hostsfile=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.hosts --dhcp-range 10.197.134.2,10.197.134.254,1h --listen-address=fd42:52a6:f1cf:6e7a::1 --enable-ra --dhcp-range ::,constructor:lxdbr0,ra-stateless,ra-names -s lxd -S /lxd/ --conf-file=/var/snap/lxd/common/lxd/networks/lxdbr0/dnsmasq.raw -u lxd

lxd systemd related bits

  sys-devices-virtual-net-lxdbr0.device                                                               loaded    active   plugged   /sys/devices/virtual/net/lxdbr0                                                 
  sys-subsystem-net-devices-lxdbr0.device                                                             loaded    active   plugged   /sys/subsystem/net/devices/lxdbr0                                               
  run-snapd-ns-lxd.mnt.mount                                                                          loaded    active   mounted   /run/snapd/ns/lxd.mnt                                                           
  snap-lxd-11437.mount                                                                                loaded    active   mounted   Mount unit for lxd, revision 11437                                              
  snap-lxd-11494.mount                                                                                loaded    active   mounted   Mount unit for lxd, revision 11494                                              
  snap.lxd.activate.service                                                                           loaded    inactive dead      Service for snap application lxd.activate                                       
  snap.lxd.daemon.service                                                                             loaded    inactive dead      Service for snap application lxd.daemon                                         
  snap.lxd.daemon.unix.socket                                                                         loaded    inactive dead      Socket unix for snap application lxd.daemon  

Nothing using port 8443

Public IP is definetly up and tied to eth0 device else I would not be able to SSH into this system. IP_ADDR everywhere here corresponds to the public IP. Second device br0 is a public bridge with /29 network some containers have interfaces routed through this.

Network wise everything is fine, starting containers via lxd command directly somehow works and everything comes up perfectly, but staring via snap. lxd does not start.

/var/snap/lxd/common/lxd/logs/lxd.log

t=2019-08-06T18:26:36+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd 
t=2019-08-06T18:26:36+0200 lvl=info msg="Kernel uid/gid map:"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - u 0 0 4294967295"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - g 0 0 4294967295"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Configured LXD uid/gid map:"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - u 0 1000000 1000000000"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - g 0 1000000 1000000000"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Kernel features:"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - netnsid-based network retrieval: yes"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - uevent injection: yes"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - seccomp listener: yes"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - unprivileged file capabilities: yes"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - shiftfs support: no"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Initializing local database"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Starting /dev/lxd handler:"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock 
t=2019-08-06T18:26:36+0200 lvl=info msg="REST API daemon:"  
t=2019-08-06T18:26:36+0200 lvl=info msg=" - binding Unix socket" socket=/var/snap/lxd/common/lxd/unix.socket 
t=2019-08-06T18:26:36+0200 lvl=info msg=" - binding TCP socket" socket=IP_ADDR:8443 
t=2019-08-06T18:26:36+0200 lvl=info msg="Initializing global database"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Initializing storage pools"  
t=2019-08-06T18:26:36+0200 lvl=info msg="Initializing networks"  
t=2019-08-06T18:26:42+0200 lvl=eror msg="Failed to start image leftover cleanup operation" err="failed to add operation 1bcd8387-6144-4798-9e48-16adacf08edf to database: database is locked" 
t=2019-08-06T18:26:42+0200 lvl=info msg="Pruning leftover image files" 

Specifically: "failed to add operation 1bcd8387-6144-4798-9e48-16adacf08edf to database: database is locked"

Attempting to restart the whole thing results in

sudo lxc info
Error: Get http://unix.socket/1.0: EOF

Also there are several log files touched seemingly simultaneously/after another?

ls -lah /var/snap/lxd/common/lxd/logs/

-rw-r--r--  1 root root 2.3K Aug  6 18:51 lxd.log
-rw-r--r--  1 root root 3.8K Aug  6 18:51 lxd.log.1
-rw-r--r--  1 root root  823 Aug  6 18:51 lxd.log.2.gz
-rw-r--r--  1 root root  826 Aug  6 18:51 lxd.log.3.gz
-rw-r--r--  1 root root  820 Aug  6 18:51 lxd.log.4.gz
-rw-r--r--  1 root root  825 Aug  6 18:51 lxd.log.5.gz
-rw-r--r--  1 root root  818 Aug  6 18:50 lxd.log.6.gz
-rw-r--r--  1 root root  822 Aug  6 18:50 lxd.log.7.gz

cat /var/snap/lxd/common/lxd/logs/lxd.log

t=2019-08-06T18:48:44+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-08-06T18:48:44+0200 lvl=info msg="Kernel uid/gid map:" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - u 0 0 4294967295" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - g 0 0 4294967295" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Configured LXD uid/gid map:" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - u 0 1000000 1000000000" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - g 0 1000000 1000000000" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Kernel features:" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - uevent injection: yes" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - seccomp listener: yes" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - shiftfs support: no" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Initializing local database" 
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Initializing database gateway" 
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Start database node" address=IP_ADDR:8443 id=1
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Detected stale unix socket, deleting" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Starting /dev/lxd handler:" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-08-06T18:48:44+0200 lvl=info msg="REST API daemon:" 
t=2019-08-06T18:48:44+0200 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-08-06T18:48:44+0200 lvl=info msg=" - binding TCP socket" socket=IP_ADDR:8443
t=2019-08-06T18:48:44+0200 lvl=info msg="Initializing global database" 
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Dqlite: connected address=IP_ADDR:8443 attempt=0" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Initializing storage pools" 
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Initializing and checking storage pool \"local\"" 
t=2019-08-06T18:48:44+0200 lvl=dbug msg="Checking ZFS storage pool \"local\"" 
t=2019-08-06T18:48:44+0200 lvl=info msg="Initializing networks" 
t=2019-08-06T18:48:49+0200 lvl=dbug msg="Dqlite: closing client" 
t=2019-08-06T18:48:49+0200 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}" 
t=2019-08-06T18:48:49+0200 lvl=dbug msg="Retry failed db interaction (driver: bad connection)" 
t=2019-08-06T18:48:50+0200 lvl=dbug msg="Found cert" name=0

cat /var/snap/lxd/common/lxd/logs/lxd.log.1

t=2019-08-06T18:48:51+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-08-06T18:48:51+0200 lvl=info msg="Kernel uid/gid map:" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - u 0 0 4294967295" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - g 0 0 4294967295" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Configured LXD uid/gid map:" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - u 0 1000000 1000000000" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - g 0 1000000 1000000000" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Kernel features:" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - uevent injection: yes" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - seccomp listener: yes" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - shiftfs support: no" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing local database" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Initializing database gateway" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Start database node" address=IP_ADDR:8443 id=1
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Detected stale unix socket, deleting" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Starting /dev/lxd handler:" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-08-06T18:48:51+0200 lvl=info msg="REST API daemon:" 
t=2019-08-06T18:48:51+0200 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-08-06T18:48:51+0200 lvl=info msg=" - binding TCP socket" socket=IP_ADDR:8443
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing global database" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Dqlite: connected address=IP_ADDR:8443 attempt=0" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing storage pools" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Initializing and checking storage pool \"local\"" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Checking ZFS storage pool \"local\"" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing networks" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Dqlite: closing client" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Retry failed db interaction (driver: bad connection)" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Found cert" name=0
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Dqlite: connected address=IP_ADDR:8443 attempt=0" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Database error: bindings.Error{Code:5, Message:\"database is locked\"}" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Retry failed db interaction (database is locked)" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Database error: bindings.Error{Code:5, Message:\"database is locked\"}" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Retry failed db interaction (database is locked)" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Database error: bindings.Error{Code:5, Message:\"database is locked\"}" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Retry failed db interaction (database is locked)" 
t=2019-08-06T18:48:58+0200 lvl=dbug msg="Database error: bindings.Error{Code:5, Message:\"database is locked\"}" 
t=2019-08-06T18:48:58+0200 lvl=dbug msg="Retry failed db interaction (database is locked)" 
t=2019-08-06T18:48:58+0200 lvl=eror msg="Failed to start image leftover cleanup operation" err="failed to add operation 3560bf07-4a8b-43ba-8a16-566b5908e75a to database: database is locked"
t=2019-08-06T18:48:58+0200 lvl=info msg="Pruning leftover image files" 
rigred commented 5 years ago

Following with sudo journalctl -u snap.lxd.daemon.service -b 0 --follow I can just see lxd service restarting endlessly with the same output

Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: => Preparing the system
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Loading snap configuration
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Setting up mntns symlink (mnt:[4026532304])
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Setting up kmod wrapper
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Preparing /boot
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Preparing a clean copy of /run
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Preparing a clean copy of /etc
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Setting up ceph configuration
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Setting up LVM configuration
Aug 06 18:55:33 hetz-hel1 lxd.daemon[17678]: ==> Rotating logs
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ==> Setting up ZFS (0.7)
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ==> Escaping the systemd cgroups
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ====> Detected cgroup V1
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ==> Escaping the systemd process resource limits
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ==> Enabling CRIU
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: ==> Disabling shiftfs on this kernel (auto)
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: => Re-using existing LXCFS
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: => Starting LXD
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="LXD 3.15 is starting in normal mode" path=/var/snap/lxd/common/lxd
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Kernel uid/gid map:"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - u 0 0 4294967295"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - g 0 0 4294967295"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Configured LXD uid/gid map:"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - u 0 1000000 1000000000"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - g 0 1000000 1000000000"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Kernel features:"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - netnsid-based network retrieval: yes"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - uevent injection: yes"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - seccomp listener: yes"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - unprivileged file capabilities: yes"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - shiftfs support: no"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Initializing local database"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Initializing database gateway"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Start database node" address=IP_ADDR:8443 id=1
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Detected stale unix socket, deleting"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Starting /dev/lxd handler:"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="REST API daemon:"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg=" - binding TCP socket" socket=IP_ADDR:8443
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Initializing global database"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Found cert" name=0
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Found cert" name=0
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Dqlite: connected address=IP_ADDR:8443 attempt=0"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Initializing storage pools"
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Initializing and checking storage pool \"local\""
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=dbug msg="Checking ZFS storage pool \"local\""
Aug 06 18:55:34 hetz-hel1 lxd.daemon[17678]: t=2019-08-06T18:55:34+0200 lvl=info msg="Initializing networks"
Aug 06 18:55:41 hetz-hel1 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=1/FAILURE
Aug 06 18:55:41 hetz-hel1 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Aug 06 18:55:41 hetz-hel1 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Aug 06 18:55:41 hetz-hel1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 73.
Aug 06 18:55:41 hetz-hel1 systemd[1]: Stopped Service for snap application lxd.daemon.
Aug 06 18:55:41 hetz-hel1 systemd[1]: Started Service for snap application lxd.daemon.
stgraber commented 5 years ago

That's really odd, I'm failing to think of something that would explain it behaving when started manually vs from the systemd unit...

Can you show systemctl cat snap.lxd.daemon.service.

rigred commented 5 years ago
# /etc/systemd/system/snap.lxd.daemon.service
[Unit]
# Auto-generated, DO NOT EDIT
Description=Service for snap application lxd.daemon
Requires=snap-lxd-11437.mount
Wants=network.target
After=snap-lxd-11437.mount network.target
X-Snappy=yes

[Service]
ExecStart=/usr/bin/snap run lxd.daemon
SyslogIdentifier=lxd.daemon
Restart=on-failure
WorkingDirectory=/var/snap/lxd/11437
ExecStop=/usr/bin/snap run --command=stop lxd.daemon
ExecReload=/usr/bin/snap run --command=reload lxd.daemon
TimeoutStopSec=600
Type=simple

Verified to be matching a working install.

stgraber commented 5 years ago

@rigred did things fail again ever since they decided to start when I was playing with the daemon?

rigred commented 5 years ago

It's been working perfectly ever since. Though I can sometimes make it fail by running a daemon reload in successive order.

I suspect it's something timing related that's happening.

On Mon, Aug 12, 2019 at 4:51 AM Stéphane Graber notifications@github.com wrote:

@rigred https://github.com/rigred did things fail again ever since they decided to start when I was playing with the daemon?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lxc/lxd/issues/6036?email_source=notifications&email_token=ABLTS3C6SPIZSEGXTVE573LQEDF2NA5CNFSM4IIS3UDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BPCUQ#issuecomment-520286546, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLTS3GR5L2IQXLOPCBVVLLQEDF2NANCNFSM4IIS3UDA .

stgraber commented 5 years ago

@freeekanayaka does database locked type error happening when restarting the daemon very quickly make sense to you?

I do remember seeing this behavior locally a couple of times when quickly restarting the daemon for some tests and systemd's restart policy certainly wouldn't help us there, but I've failed to reproduce this behavior locally now.

freeekanayaka commented 5 years ago

@stgraber, no restarting the daemon quickly shouldn't have any particular effect in that regard. It feels like some kind of bad cascading error:

t=2019-08-06T18:48:51+0200 lvl=dbug msg="Dqlite: connected address=IP_ADDR:8443 attempt=0" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing storage pools" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Initializing and checking storage pool \"local\"" 
t=2019-08-06T18:48:51+0200 lvl=dbug msg="Checking ZFS storage pool \"local\"" 
t=2019-08-06T18:48:51+0200 lvl=info msg="Initializing networks" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Dqlite: closing client" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}" 
t=2019-08-06T18:48:57+0200 lvl=dbug msg="Retry failed db interaction (driver: bad connection)" 

It feels that leadership is not stable for some reason, and causes some cascading error which we might not handle too well.

stgraber commented 5 years ago

@rigred how are things going for you lately? did this issue ever reappear?

rigred commented 5 years ago

Missed your comment in the clutter of my other notifications. The issue hasn't reoccurred since, which both nice yet also concerning, since I have no idea when/how/why it might reoccur.

stgraber commented 5 years ago

Ok, for now we'll assume that this was one of the many issues that we've sorted out with changes to dqlite around the dqlite 1.0 release. If it does happen again for you or anyone else, please comment here and we'll re-open.

rigred commented 4 years ago

It has re-occurred for the same machine in the same/similar manner. Doing a reload fails all the time. It did restart on it's own once. But since then absolutely nothing can bring it back up.

Also what's the explanation of "use of closed network connection", it doesn't make sense to me as the connection it's referring to is available.

Btw this system isn't a prod server, just my own box I use for my own DIY projects.

Oct 01 21:27:47 hetz-hel1 lxd.daemon[18671]: => Preparing the system
Oct 01 21:27:47 hetz-hel1 lxd.daemon[18671]: ==> Loading snap configuration
Oct 01 21:27:47 hetz-hel1 lxd.daemon[18671]: ==> Setting up mntns symlink (mnt:[4026532357])
Oct 01 21:27:55 hetz-hel1 lxd.daemon[18671]: ==> Setting up kmod wrapper
Oct 01 21:27:55 hetz-hel1 lxd.daemon[18671]: ==> Preparing /boot
Oct 01 21:28:04 hetz-hel1 lxd.daemon[18671]: ==> Preparing a clean copy of /run
Oct 01 21:28:13 hetz-hel1 lxd.daemon[18671]: ==> Preparing a clean copy of /etc
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Setting up ceph configuration
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Setting up LVM configuration
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Rotating logs
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Setting up ZFS (0.7)
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Escaping the systemd cgroups
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ====> Detected cgroup V1
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Escaping the systemd process resource limits
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Enabling CRIU
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: ==> Disabling shiftfs on this kernel (auto)
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: => Re-using existing LXCFS
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: => Starting LXD
Oct 01 21:28:58 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:28:58+0200 lvl=warn msg="Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored."
Oct 01 21:29:12 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:12+0200 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48290->PUBLIC_IP:8443: use of closed network connection"
Oct 01 21:29:12 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:12+0200 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@17a17: use of closed network connection"
Oct 01 21:29:17 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:17+0200 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48294->PUBLIC_IP:8443: use of closed network connection"
Oct 01 21:29:17 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:17+0200 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@17a17: use of closed network connection"
Oct 01 21:29:22 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:22+0200 lvl=eror msg="Failed to start the daemon: failed to open cluster database: Failed to set page cache size: driver: bad connection"
Oct 01 21:29:22 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:22+0200 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48298->PUBLIC_IP:8443: use of closed network connection"
Oct 01 21:29:22 hetz-hel1 lxd.daemon[18671]: t=2019-10-01T21:29:22+0200 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@17a17: use of closed network connection"
Oct 01 21:29:22 hetz-hel1 lxd.daemon[18671]: Error: failed to open cluster database: Failed to set page cache size: driver: bad connection
Oct 01 21:29:23 hetz-hel1 lxd.daemon[18671]: => LXD failed to start
root@hetz-hel1:~# lxd --group lxd --debug
INFO[10-01|21:34:54] LXD 3.17 is starting in normal mode      path=/var/snap/lxd/common/lxd
INFO[10-01|21:34:54] Kernel uid/gid map: 
INFO[10-01|21:34:54]  - u 0 0 4294967295 
INFO[10-01|21:34:54]  - g 0 0 4294967295 
INFO[10-01|21:34:54] Configured LXD uid/gid map: 
INFO[10-01|21:34:54]  - u 0 1000000 1000000000 
INFO[10-01|21:34:54]  - g 0 1000000 1000000000 
WARN[10-01|21:34:54] Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored. 
INFO[10-01|21:34:54] Kernel features: 
INFO[10-01|21:34:54]  - netnsid-based network retrieval: yes 
INFO[10-01|21:34:54]  - uevent injection: yes 
INFO[10-01|21:34:54]  - seccomp listener: yes 
INFO[10-01|21:34:54]  - unprivileged file capabilities: yes 
INFO[10-01|21:34:54]  - shiftfs support: yes 
INFO[10-01|21:34:54] Initializing local database 
DBUG[10-01|21:34:54] Initializing database gateway 
DBUG[10-01|21:34:54] Start database node                      id=1 address=PUBLIC_IP:8443
INFO[10-01|21:35:03] Starting /dev/lxd handler: 
INFO[10-01|21:35:03]  - binding devlxd socket                 socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[10-01|21:35:03] REST API daemon: 
INFO[10-01|21:35:03]  - binding Unix socket                   socket=/var/snap/lxd/common/lxd/unix.socket
INFO[10-01|21:35:03]  - binding TCP socket                    socket=PUBLIC_IP:8443
INFO[10-01|21:35:03] Initializing global database 
DBUG[10-01|21:35:03] Found cert                               name=0
DBUG[10-01|21:35:03] Found cert                               name=0
DBUG[10-01|21:35:03] Dqlite: connected address=PUBLIC_IP:8443 attempt=0 
WARN[10-01|21:35:08] Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48314->PUBLIC_IP:8443: use of closed network connection 
WARN[10-01|21:35:08] Dqlite server proxy Unix -> TLS: read unix @->@17a1f: use of closed network connection 
DBUG[10-01|21:35:08] Found cert                               name=0
DBUG[10-01|21:35:08] Found cert                               name=0
DBUG[10-01|21:35:08] Dqlite: connected address=PUBLIC_IP:8443 attempt=0 
WARN[10-01|21:35:13] Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48318->PUBLIC_IP:8443: use of closed network connection 
WARN[10-01|21:35:13] Dqlite server proxy Unix -> TLS: read unix @->@17a1f: use of closed network connection 
DBUG[10-01|21:35:13] Found cert                               name=0
DBUG[10-01|21:35:13] Found cert                               name=0
DBUG[10-01|21:35:13] Dqlite: connected address=PUBLIC_IP:8443 attempt=0 
EROR[10-01|21:35:18] Failed to start the daemon: failed to open cluster database: Failed to set page cache size: driver: bad connection 
INFO[10-01|21:35:18] Starting shutdown sequence 
INFO[10-01|21:35:18] Stopping REST API handler: 
INFO[10-01|21:35:18]  - closing socket                        socket=PUBLIC_IP:8443
WARN[10-01|21:35:18] Dqlite client proxy TLS -> Unix: read tcp PUBLIC_IP:48322->PUBLIC_IP:8443: use of closed network connection 
INFO[10-01|21:35:18]  - closing socket                        socket=/var/snap/lxd/common/lxd/unix.socket
WARN[10-01|21:35:18] Dqlite server proxy Unix -> TLS: read unix @->@17a1f: use of closed network connection 
INFO[10-01|21:35:18] Stopping /dev/lxd handler: 
INFO[10-01|21:35:18]  - closing socket                        socket=/var/snap/lxd/common/lxd/devlxd/sock
DBUG[10-01|21:35:18] Stop database gateway 
DBUG[10-01|21:35:18] Not unmounting temporary filesystems (containers are still running) 
Error: failed to open cluster database: Failed to set page cache size: driver: bad connection
stgraber commented 4 years ago

@freeekanayaka ideas?

I would normally suspect an IP/PORT conflict or firewalling preventing the connection, but given that this is the local system, I'm not sure how that could be.