Closed turtle0x1 closed 5 years ago
Does that prevent you from joining the node?
And how exactly did you trigger the join? lxd init
or direct API?
Yes it does and lxd init
Did you have it fail to join earlier for another reason?
On the working node, can you show lxd sql global "SELECT * FROM nodes;"
and lxc cluster list
?
root@cluster-node1:~# lxd sql global "SELECT * FROM nodes;"
+----+---------------+-------------+--------------------+--------+----------------+--------------------------------+---------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
+----+---------------+-------------+--------------------+--------+----------------+--------------------------------+---------+
| 1 | cluster-node1 | | 10.233.30.54:8443 | 18 | 148 | 2019-10-14T21:24:31.061676485Z | 0 |
| 2 | cluster-node2 | | 10.233.30.169:8443 | 18 | 148 | 2019-10-14T20:34:52Z | 1 |
+----+---------------+-------------+--------------------+--------+----------------+--------------------------------+---------+
root@cluster-node1:~# lxc cluster list
+---------------+---------------------------+----------+--------+-------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
+---------------+---------------------------+----------+--------+-------------------+
| cluster-node1 | https://10.233.30.54:8443 | YES | ONLINE | fully operational |
+---------------+---------------------------+----------+--------+-------------------+
root@cluster-node1:~#
It may be that the client just hangs ?
Ok, so that does suggest that you had a partial failure to join which is now making a bit of a mess.
Can you try lxd sql global "DELETE FROM nodes WHERE id=2;"
and then try to join again, see if you get a different error then?
We probably ought to have LXD delete any pending join when attempting to join again, but that's probably not the root of this issue.
It kinda completly messes up my hosts lxd it cant really stop containers
Hmm, can you provide lxd.log for both servers?
I wonder just how far it went into joining before it failed, if it got to joining the clustered database, that could be a problem indeed.
From node 1
t=2019-10-14T20:33:35+0000 lvl=info msg="LXD 3.18 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-10-14T20:33:35+0000 lvl=info msg="Kernel uid/gid map:"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2019-10-14T20:33:35+0000 lvl=info msg="Configured LXD uid/gid map:"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - u 0 100000 999900000"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - g 0 100000 999900000"
t=2019-10-14T20:33:35+0000 lvl=warn msg="Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored."
t=2019-10-14T20:33:35+0000 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
t=2019-10-14T20:33:35+0000 lvl=info msg="Kernel features:"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - uevent injection: yes"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - seccomp listener: yes"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - shiftfs support: no"
t=2019-10-14T20:33:35+0000 lvl=info msg="Initializing local database"
t=2019-10-14T20:33:35+0000 lvl=info msg="Starting /dev/lxd handler:"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-10-14T20:33:35+0000 lvl=info msg="REST API daemon:"
t=2019-10-14T20:33:35+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-10-14T20:33:35+0000 lvl=info msg="Initializing global database"
t=2019-10-14T20:33:37+0000 lvl=info msg="Initializing storage pools"
t=2019-10-14T20:33:37+0000 lvl=info msg="Initializing networks"
t=2019-10-14T20:33:37+0000 lvl=info msg="Pruning leftover image files"
t=2019-10-14T20:33:37+0000 lvl=info msg="Done pruning leftover image files"
t=2019-10-14T20:33:37+0000 lvl=info msg="Loading daemon configuration"
t=2019-10-14T20:33:38+0000 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2019-10-14T20:33:38+0000 lvl=info msg="Pruning expired images"
t=2019-10-14T20:33:38+0000 lvl=info msg="Done pruning expired images"
t=2019-10-14T20:33:38+0000 lvl=info msg="Pruning expired container backups"
t=2019-10-14T20:33:38+0000 lvl=info msg="Done pruning expired container backups"
t=2019-10-14T20:33:38+0000 lvl=info msg="Expiring log files"
t=2019-10-14T20:33:38+0000 lvl=info msg="Done expiring log files"
t=2019-10-14T20:33:38+0000 lvl=info msg="Updating images"
t=2019-10-14T20:33:38+0000 lvl=info msg="Updating instance types"
t=2019-10-14T20:33:38+0000 lvl=info msg="Done updating images"
t=2019-10-14T20:33:38+0000 lvl=info msg="Done updating instance types"
t=2019-10-14T20:34:00+0000 lvl=info msg="Update network address"
t=2019-10-14T20:34:00+0000 lvl=info msg=" - binding TCP socket" socket=10.233.30.54:8443
t=2019-10-14T20:34:00+0000 lvl=info msg="Update cluster address"
t=2019-10-14T20:34:00+0000 lvl=info msg="Creating DIR storage pool \"local\""
t=2019-10-14T20:34:00+0000 lvl=info msg="Created DIR storage pool \"local\""
t=2019-10-14T20:34:00+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-10-14T20:34:04+0000 lvl=info msg="Refreshing forkdns peers for lxdfan0"
t=2019-10-14T20:34:57+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@000b3: use of closed network connection"
t=2019-10-14T20:35:00+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:35:10+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:35:20+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:35:30+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:35:40+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:35:50+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:00+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:10+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:20+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:30+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:40+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:36:50+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:00+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48500
t=2019-10-14T20:37:00+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48510
t=2019-10-14T20:37:00+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:01+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48520
t=2019-10-14T20:37:01+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48530
t=2019-10-14T20:37:01+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48540
t=2019-10-14T20:37:02+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48550
t=2019-10-14T20:37:02+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48560
t=2019-10-14T20:37:02+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48570
t=2019-10-14T20:37:03+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48580
t=2019-10-14T20:37:03+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48590
t=2019-10-14T20:37:03+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48600
t=2019-10-14T20:37:04+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48610
t=2019-10-14T20:37:04+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48620
t=2019-10-14T20:37:05+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48630
t=2019-10-14T20:37:05+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48640
t=2019-10-14T20:37:05+0000 lvl=warn msg="Rejecting request from untrusted client" ip=10.233.30.1:48650
t=2019-10-14T20:37:10+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:20+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:30+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:40+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:37:50+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
t=2019-10-14T20:38:00+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.169:8443:{ID:2 Address:10.233.30.169:8443}]"
From node 2
t=2019-10-14T20:34:11+0000 lvl=info msg="LXD 3.18 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2019-10-14T20:34:11+0000 lvl=info msg="Kernel uid/gid map:"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - u 0 1000000 1000000000"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - g 0 1000000 1000000000"
t=2019-10-14T20:34:11+0000 lvl=info msg="Configured LXD uid/gid map:"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - u 0 100000 999900000"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - g 0 100000 999900000"
t=2019-10-14T20:34:11+0000 lvl=warn msg="Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored."
t=2019-10-14T20:34:11+0000 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
t=2019-10-14T20:34:11+0000 lvl=info msg="Kernel features:"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - uevent injection: yes"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - seccomp listener: yes"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - unprivileged file capabilities: yes"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - shiftfs support: no"
t=2019-10-14T20:34:11+0000 lvl=info msg="Initializing local database"
t=2019-10-14T20:34:11+0000 lvl=info msg="Starting /dev/lxd handler:"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2019-10-14T20:34:11+0000 lvl=info msg="REST API daemon:"
t=2019-10-14T20:34:11+0000 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2019-10-14T20:34:11+0000 lvl=info msg="Initializing global database"
t=2019-10-14T20:34:15+0000 lvl=info msg="Initializing storage pools"
t=2019-10-14T20:34:15+0000 lvl=info msg="Initializing networks"
t=2019-10-14T20:34:15+0000 lvl=info msg="Pruning leftover image files"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done pruning leftover image files"
t=2019-10-14T20:34:15+0000 lvl=info msg="Loading daemon configuration"
t=2019-10-14T20:34:15+0000 lvl=info msg="Started seccomp handler" path=/var/snap/lxd/common/lxd/seccomp.socket
t=2019-10-14T20:34:15+0000 lvl=info msg="Pruning expired images"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done pruning expired images"
t=2019-10-14T20:34:15+0000 lvl=info msg="Pruning expired container backups"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done pruning expired container backups"
t=2019-10-14T20:34:15+0000 lvl=info msg="Updating instance types"
t=2019-10-14T20:34:15+0000 lvl=info msg="Updating images"
t=2019-10-14T20:34:15+0000 lvl=info msg="Expiring log files"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done updating images"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done expiring log files"
t=2019-10-14T20:34:15+0000 lvl=info msg="Done updating instance types"
t=2019-10-14T20:34:50+0000 lvl=info msg="Update network address"
t=2019-10-14T20:34:50+0000 lvl=info msg=" - binding TCP socket" socket=10.233.30.169:8443
t=2019-10-14T20:34:51+0000 lvl=info msg="Creating DIR storage pool \"local\""
t=2019-10-14T20:34:51+0000 lvl=info msg="Created DIR storage pool \"local\""
t=2019-10-14T20:34:52+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-10-14T20:34:52+0000 lvl=info msg="Joining dqlite raft cluster" address=10.233.30.169:8443 id=2 target=10.233.30.169:8443
t=2019-10-14T20:34:57+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.233.30.169:41364->10.233.30.54:8443: use of closed network connection"
t=2019-10-14T21:34:54+0000 lvl=info msg="Asked to shutdown by API, shutting down containers"
Is LXD still responding properly on node1?
Node2 is likely to be pretty confused about its state right now. You mentioned running containers, I'm assuming this wasn't on node2 as new cluster nodes must be empty.
root@cluster-node1:~# lxc list
Error: cannot fetch node config from database: driver: bad connection
No these two nodes are running in containers
Hmm, I'm confused, so you had containers running on node2 and then tried to have it join the cluster?
Something like this:
My PC Using lxd -> cluster-node1 -> cluster-node2
There were no containers running on either cluster-node1 or cluster-node2
3.0.3 didn't have this problem forming clusters
Ah, okay, that makes more sense.
So the issue is really about how far things went before things failed. The worst case scenario is if LXD managed to replicate the database to node2, increased the database quorum to 2 and then somehow failed, leaving the node in in pending state but the DB dependent on the second node.
What I'd really like to get is the initial join error though as that can't have been the Unaccounted raft ...
one, that particular error is only possible on subsequent join failure.
If it's just a test cluster with nothing on it, can you reset both containers and recreate the original failure?
I'm currently looking at some other failure cases during join to try to make things a bit harder to break, especially around mismatch in version and the like.
Sorry I need to go now (late here) but im pretty confident I can re-create this error tomorrow
Thanks, that'd be great to have that error!
How to duplicate
config:
security.nesting: "true"
description: Default LXD profile
devices:
eth0:
name: eth0
nictype: bridged
parent: lxdbr0
type: nic
root:
path: /
pool: default
type: disk
name: default
used_by:
- /1.0/containers/node1
- /1.0/containers/node2
- /1.0/containers/node3
lxc launch ubuntu: node1 -p default
lxc launch ubuntu: node2 -p default
lxc launch ubuntu: node3 -p default
## On each node
lxc exec node bash
snap install lxd
lxd.migate
=> Nothing here want to uninstall deb ubuntu y/n
=> uninstalled
lxc --version
=> 3.18
lxd init
=> See configs below
config:
core.https_address: 10.233.30.213:8443
core.trust_password: clusterPassword
networks:
- config:
bridge.mode: fan
fan.underlay_subnet: auto
description: ""
managed: false
name: lxdfan0
type: ""
storage_pools:
- config: {}
description: ""
name: local
driver: dir
profiles:
- config: {}
description: ""
devices:
eth0:
name: eth0
nictype: bridged
parent: lxdfan0
type: nic
root:
path: /
pool: local
type: disk
name: default
cluster:
server_name: node1
enabled: true
member_config: []
cluster_address: ""
cluster_certificate: ""
server_address: ""
cluster_password: ""
config: {}
networks: []
storage_pools: []
profiles: []
cluster:
server_name: node2
enabled: true
member_config:
- entity: storage-pool
name: local
key: source
value: ""
description: '"source" property for storage pool "local"'
cluster_address: 10.233.30.213:8443
cluster_certificate: |
-----BEGIN CERTIFICATE-----
Some cert
-----END CERTIFICATE-----
server_address: 10.233.30.75:8443
cluster_password: clusterPassword
It hangs after the you print the config for node 2
The logs I think are intreseting
Oct 17 17:12:44 node1 lxd.daemon[1283]: t=2019-10-17T17:12:44+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.233.30.213:33698->10.233.30.213:8443: use of closed network connection"
Oct 17 17:12:44 node1 lxd.daemon[1283]: t=2019-10-17T17:12:44+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@000af: use of closed network connection"
Oct 17 17:13:37 node1 ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
Oct 17 17:13:37 node1 ovs-vsctl: message repeated 2 times: [ ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)]
Oct 17 17:13:39 node1 lxd.daemon[1283]: t=2019-10-17T17:13:39+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.75:8443:{ID:2 Address:10.233.30.75:8443}]"
Oct 17 17:13:43 node1 lxd.daemon[1283]: t=2019-10-17T17:13:43+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@000af: use of closed network connection"
Oct 17 17:13:49 node1 lxd.daemon[1283]: t=2019-10-17T17:13:49+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.75:8443:{ID:2 Address:10.233.30.75:8443}]"
Oct 17 17:13:59 node1 lxd.daemon[1283]: t=2019-10-17T17:13:59+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.233.30.75:8443:{ID:2 Address:10.233.30.75:8443}]"
The whole log up until the log above
Oct 17 17:12:13 node1 systemd[1]: system.slice: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: cron.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: console-setup.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-resolved.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: message repeated 7 times: [ Failed to set devices.allow on /system.slice/systemd-resolved.service: Operation not permitted]
Oct 17 17:12:13 node1 systemd[1]: ebtables.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: cloud-final.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-user-sessions.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-sysctl.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: unattended-upgrades.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-networkd.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-udev-trigger.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snapd.socket: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: polkit.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: apport.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: networkd-dispatcher.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-journald.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-tmpfiles-setup.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: blk-availability.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: sys-fs-fuse-connections.mount: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snapd.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: console-getty.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-update-utmp.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-tmpfiles-setup-dev.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snap-core-7917.mount: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: ssh.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: rsyslog.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-udevd.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: apparmor.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-networkd-wait-online.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snap.lxd.daemon.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: ufw.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: accounts-daemon.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: atd.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-journal-flush.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: cloud-init-local.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snap.mount: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: cloud-init.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snap-lxd-12211.mount: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: cloud-config.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: user.slice: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: dbus.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: keyboard-setup.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: system-getty.slice: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: snapd.seeded.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:13 node1 systemd[1]: systemd-logind.service: Failed to reset devices.list: Operation not permitted
Oct 17 17:12:38 node1 networkd-dispatcher[209]: WARNING:Unknown index 2 seen, reloading interface list
Oct 17 17:12:38 node1 systemd-udevd[2461]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 17 17:12:38 node1 systemd-udevd[2461]: Could not generate persistent MAC address for lxdfan0: No such file or directory
Oct 17 17:12:38 node1 networkd-dispatcher[209]: WARNING:Unknown index 3 seen, reloading interface list
Oct 17 17:12:38 node1 systemd-udevd[2464]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 17 17:12:38 node1 systemd-udevd[2464]: Could not generate persistent MAC address for lxdfan0-mtu: No such file or directory
Oct 17 17:12:38 node1 systemd-networkd[180]: lxdfan0-mtu: Gained carrier
Oct 17 17:12:38 node1 systemd-networkd[180]: lxdfan0-mtu: Gained IPv6LL
Oct 17 17:12:38 node1 networkd-dispatcher[209]: WARNING:Unknown index 4 seen, reloading interface list
Oct 17 17:12:38 node1 systemd-udevd[2503]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 17 17:12:38 node1 systemd-udevd[2503]: Could not generate persistent MAC address for lxdfan0-fan: No such file or directory
Oct 17 17:12:38 node1 systemd-networkd[180]: lxdfan0-fan: Gained carrier
Oct 17 17:12:39 node1 dnsmasq[2515]: started, version 2.75 cachesize 150
Oct 17 17:12:39 node1 dnsmasq[2515]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify
Oct 17 17:12:39 node1 dnsmasq[2515]: LOUD WARNING: listening on 240.213.0.1 may accept requests via interfaces other than lxdfan0
Oct 17 17:12:39 node1 dnsmasq[2515]: LOUD WARNING: use --bind-dynamic rather than --bind-interfaces to avoid DNS amplification attacks via these interface(s)
Oct 17 17:12:39 node1 dnsmasq-dhcp[2515]: DHCP, IP range 240.213.0.2 -- 240.213.0.254, lease time 1h
Oct 17 17:12:39 node1 dnsmasq-dhcp[2515]: DHCP, sockets bound exclusively to interface lxdfan0
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain 240.in-addr.arpa
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain lxd
Oct 17 17:12:39 node1 dnsmasq[2515]: reading /etc/resolv.conf
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain 240.in-addr.arpa
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain lxd
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 127.0.0.53#53
Oct 17 17:12:39 node1 dnsmasq[2515]: read /etc/hosts - 7 addresses
Oct 17 17:12:39 node1 dnsmasq[2515]: read /etc/hosts - 7 addresses
Oct 17 17:12:39 node1 lxd-forkdns[2520]: t=2019-10-17T17:12:39+0000 lvl=info msg=Started
Oct 17 17:12:39 node1 lxd-forkdns[2520]: t=2019-10-17T17:12:39+0000 lvl=info msg="Server list loaded: []"
Oct 17 17:12:39 node1 lxd.daemon[1283]: t=2019-10-17T17:12:39+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
Oct 17 17:12:39 node1 systemd-networkd[180]: lxdfan0: Gained carrier
Oct 17 17:12:39 node1 dnsmasq[2515]: reading /etc/resolv.conf
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain 240.in-addr.arpa
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain lxd
Oct 17 17:12:39 node1 dnsmasq[2515]: using nameserver 127.0.0.53#53
Oct 17 17:12:40 node1 systemd-networkd[180]: lxdfan0-fan: Gained IPv6LL
Oct 17 17:12:40 node1 dnsmasq[2515]: reading /etc/resolv.conf
Oct 17 17:12:40 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain 240.in-addr.arpa
Oct 17 17:12:40 node1 dnsmasq[2515]: using nameserver 240.213.0.1#1053 for domain lxd
Oct 17 17:12:40 node1 dnsmasq[2515]: using nameserver 127.0.0.53#53
Oct 17 17:12:41 node1 systemd-networkd[180]: lxdfan0: Gained IPv6LL
root@node1:~# lxd sql global "select * from Nodes;";
+----+-------+-------------+--------------------+--------+----------------+--------------------------------+---------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
+----+-------+-------------+--------------------+--------+----------------+--------------------------------+---------+
| 1 | node1 | | 10.233.30.213:8443 | 18 | 148 | 2019-10-17T17:31:39.278363379Z | 0 |
| 2 | node2 | | 10.233.30.75:8443 | 18 | 148 | 2019-10-17T17:13:37Z | 1 |
+----+-------+-------------+--------------------+--------+----------------+--------------------------------+---------+
Thanks. I'm on vacation this week but will take another look at this early next week.
Enjoy 😊
@turtle0x1 can you easily test with the edge build (snap install lxd --edge
)?
I've done some initial testing and edge doesn't appear affected, that or I just got lucky.
Yeah, it will be a couple of days at this point though!
What's odd is that we have daily tests for clustering which pretty much does the exact same thing you are and I don't believe I've seen such hangs in those tests, so I wonder what the actual condition is which causes that issue.
It could be lots of things, messing around with clustering required alot of purging lxd and re installing
@turtle0x1 did you have a chance to poke at this some more?
No sorry, i moved on clustering is wayy to annoying! We can close this if you want
I'm running into the exact same problem while installing new LXD cluster with the same ansible scripts i've been using for a long time now.
Having this problem when installing from 3.17/stable & 3.18/stable
Trying edge now. Will report back.
Interesting, I've seen this hit once but haven't been able to reproduce it on edge, it's also never hit our daily cluster testing which automatically assembles a 8 nodes cluster using something very similar to what @turtle0x1 described. If we can get some kind of reliable reproducer we can use on our end, then we can track this down.
The message Unaccounted raft nodes...
isn't really an issue by itself, it just indicates that a raft member was added that isn't yet listed in the global database. This is effectively just a race between the hearbeat sending function and the join/promotion logic, the message is logged so we can track down inconsistencies longer term, but isn't itself an error and so shouldn't cause the hangs described here.
Alright, edge is reporting the same problems. Both node2 & node3 are joining node1 cluster via lxd init --preseed
Node1:
t=2019-10-29T14:34:57+0000 lvl=info msg="Creating ZFS storage pool \"zfs\""
t=2019-10-29T14:35:00+0000 lvl=info msg="Created ZFS storage pool \"zfs\""
t=2019-10-29T14:35:01+0000 lvl=info msg="Creating CEPH OSD storage pool \"ceph\" in cluster \"ceph\""
t=2019-10-29T14:35:06+0000 lvl=info msg="Created CEPH OSD storage pool \"ceph\" in cluster \"ceph\""
t=2019-10-29T14:35:06+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-10-29T14:35:12+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.254.0.4:50258->10.254.0.4:8443: use of closed network connection"
t=2019-10-29T14:35:12+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@0001e: use of closed network connection"
t=2019-10-29T14:35:56+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:35:57+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@0001e: use of closed network connection"
t=2019-10-29T14:35:57+0000 lvl=eror msg="Empty raft node set received"
t=2019-10-29T14:36:01+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@0001e: use of closed network connection"
t=2019-10-29T14:36:06+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:36:16+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:36:26+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:36:36+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:36:46+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:36:56+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
t=2019-10-29T14:37:06+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443}]"
node2:
t=2019-10-29T14:35:48+0000 lvl=info msg="Creating ZFS storage pool \"zfs\""
t=2019-10-29T14:35:55+0000 lvl=info msg="Created ZFS storage pool \"zfs\""
t=2019-10-29T14:35:55+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-10-29T14:35:56+0000 lvl=info msg="Joining dqlite raft cluster" address=10.254.0.3:8443 id=2 target=10.254.0.3:8443
t=2019-10-29T14:35:57+0000 lvl=eror msg="Empty raft node set received"
t=2019-10-29T14:36:01+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.254.0.3:55172->10.254.0.4:8443: use of closed network connection"
node3:
t=2019-10-29T14:35:48+0000 lvl=info msg="Creating ZFS storage pool \"zfs\""
t=2019-10-29T14:35:56+0000 lvl=info msg="Created ZFS storage pool \"zfs\""
t=2019-10-29T14:35:56+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-10-29T14:35:56+0000 lvl=info msg="Joining dqlite raft cluster" address=10.254.0.5:8443 id=3 target=10.254.0.5:8443
t=2019-10-29T14:35:56+0000 lvl=info msg="Migrate local data to cluster database"
t=2019-10-29T14:35:57+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.3:8443:{ID:2 Address:10.254.0.3:8443} 10.254.0.4:8443:{ID:1 Address:10.254.0.4:8443} 10.254.0.5:8443:{ID:3 Address:10.254.0.5:8443}]"
t=2019-10-29T14:35:57+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.254.0.5:58170->10.254.0.4:8443: use of closed network connection"
t=2019-10-29T14:35:57+0000 lvl=eror msg="Failed to retrieve the information of leader member: No such object"
Node1 global nodes list:
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
| 1 | alpha-0001 | | 10.254.0.4:8443 | 18 | 152 | 2019-10-29T14:40:16.928006527Z | 0 |
| 2 | alpha-0002 | | 10.254.0.3:8443 | 18 | 152 | 2019-10-29T14:35:55Z | 1 |
| 3 | alpha-0003 | | 10.254.0.5:8443 | 18 | 152 | 2019-10-29T14:40:16.928437736Z | 0 |
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
At this moment i'm unable to setup a working LXD cluster, while i've been able to setup a working cluster in the past with the exact same ansible scripts. Please let me know if i can supply more information
@suprasoil can you provide the answers you used for lxd init
on node1 and the preseed you used for the other two?
Also are those physical systems or are they LXD containers themselves?
@suprasoil what kind of storage speed are you dealing with? I believe the only time I managed to reproduce this was on a reasonably slow systems. Everything else I'm playing with have stupidly fast NVME SSDs (including our CI runs) and so didn't seem to hit it.
node1 config:
config:
core.https_address: 10.254.0.4:8443
core.trust_password: secret
cluster:
server_name: alpha-0001
enabled: true
member_config: []
cluster_address: ""
cluster_certificate: ""
server_address: ""
cluster_password: ""
networks: []
storage_pools:
- config:
source: forest/lxd
description: ""
name: zfs
driver: zfs
- config:
ceph.cluster_name: ceph
ceph.osd.pg_num: "128"
ceph.osd.pool_name: lxd
description: ""
name: ceph
driver: ceph
profiles:
- config:
boot.autostart: "1"
raw.idmap: both 1337 0
description: clients LXD profile
devices:
eth0:
nictype: bridged
parent: clients
type: nic
root:
path: /
pool: zfs
type: disk
name: default
Node2 & 3(although with different address & name) config:
cluster:
server_name: alpha-0002
enabled: true
cluster_address: 10.254.0.4:8443
cluster_certificate: |
-----BEGIN CERTIFICATE-----
removed
-----END CERTIFICATE-----
server_address: 10.254.0.3:8443
cluster_password: secret
storage_pools:
- config:
source: forest/lxd
description: ""
name: zfs
driver: zfs
Hosts are KVM/qemu virtual machines running from a SATA600 SSD.
Would you have any tips on how to work around this problem? I'm currenly unable to setup a cluster.
I've tried setting it up automatically and manually, both result in this problem.
Tested setting up cluster manually on 3 fresh ubuntu 18.04 hosts with latest LXD snap, which seems to be working fine, will investigate further.
They are running 3.0.3 through right? Cause that works fine in my testing
Nope, they're running latest snap, editted comment.
Every time i try this with my 3 test nodes, the first node initializes just fine, one of the other nodes connects just fine to the first node. It's always the last node that won't connect and causes the perpetual "unaccounted raft node(s) not found" errors.
I should mention the nodes are running of a ZFS root filesystem, could that have something to do with this? From what i could find, it seems sqlite could be quite a lot slower on ZFS.
I've been setting up LXD clusters on the same ZFS setup for over a year now though, haven't encountered this before.
@suprasoil the YAML preseed that you pasted for node 2/3 seems wrong:
cluster:
server_name: alpha-0002
enabled: true
cluster_address: 10.254.0.4:8443
cluster_certificate: |
-----BEGIN CERTIFICATE-----
removed
-----END CERTIFICATE-----
server_address: 10.254.0.3:8443
cluster_password: secret
storage_pools:
- config:
source: forest/lxd
description: ""
name: zfs
driver: zfs
You should only have a cluster
section there, and provide storage-related configuration options in member_config
. The YAML you are using might have worked with 3.0.3, but the format has changed since then.
Please see https://lxd.readthedocs.io/en/latest/clustering/#preseed
After looking through the clustering logic, and seeing that we're looking up nodes from the raft_nodes
table i discovered that none of my nodes have a raft_nodes
table.
root@alpha-0001:/# lxd sql global "select name from sqlite_master where type='table';"
+------------------------------------+
| name |
+------------------------------------+
| schema |
| sqlite_sequence |
| certificates |
| config |
| images |
| images_aliases |
| images_nodes |
| images_properties |
| images_source |
| instances |
| instances_backups |
| instances_config |
| instances_devices |
| instances_devices_config |
| instances_profiles |
| instances_snapshots |
| instances_snapshots_config |
| instances_snapshots_devices |
| instances_snapshots_devices_config |
| networks |
| networks_config |
| networks_nodes |
| nodes |
| operations |
| profiles |
| profiles_config |
| profiles_devices |
| profiles_devices_config |
| projects |
| projects_config |
| storage_pools |
| storage_pools_config |
| storage_pools_nodes |
| storage_volumes |
| storage_volumes_config |
+------------------------------------+
@freeekanayaka I will look into that and report back.
Probably you need something like:
member_config:
- entity: storage-pool
name: zfs
key: source
value: forest/lxd
Also, FYI the raft_nodes
table is in the local database:
lxd sql local "SELECT * FROM raft_nodes"
Also, FYI the raft_nodes table is in the local database:
Doh, could have guessed that.
the YAML preseed that you pasted for node 2/3 seems wrong:
I've tried setting it up manually by using the lxd init command as well, also results in this problem.
Now using this config:
cluster:
server_name: alpha-0002
enabled: true
cluster_address: 10.254.0.4:8443
cluster_certificate: |
-----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----
server_address: 10.254.0.3:8443
cluster_password: pass
member_config:
- entity: storage-pool
name: zfs
driver: zfs
key: source
value: "forest/lxd"
First node is again logging these errors:
t=2019-11-06T13:36:07+0000 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[10.254.0.5:8443:{ID:2 Address:10.254.0.5:8443}]"
Raft nodes table on first node:
root@alpha-0001:/# lxd sql local "select * from raft_nodes;"
+----+-----------------+
| id | address |
+----+-----------------+
| 1 | 10.254.0.4:8443 |
| 2 | 10.254.0.5:8443 |
+----+-----------------+
Nodes table on first node:
root@alpha-0001:/# lxd sql global "select * from nodes;"
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
| id | name | description | address | schema | api_extensions | heartbeat | pending |
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
| 1 | alpha-0001 | | 10.254.0.4:8443 | 17 | 143 | 2019-11-06T13:37:09.434069101Z | 0 |
| 2 | alpha-0003 | | 10.254.0.5:8443 | 17 | 143 | 2019-11-06T13:33:53Z | 1 |
| 3 | alpha-0002 | | 10.254.0.3:8443 | 17 | 143 | 2019-11-06T13:37:09.43430051Z | 0 |
+----+------------+-------------+-----------------+--------+----------------+--------------------------------+---------+
Also, node 2 is marked as part of the cluster, but it's address is not in the nodes list:
root@alpha-0001:/# lxc cluster list
+------------+-------------------------+----------+--------+-------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
+------------+-------------------------+----------+--------+-------------------+
| alpha-0001 | https://10.254.0.4:8443 | YES | ONLINE | fully operational |
+------------+-------------------------+----------+--------+-------------------+
| alpha-0002 | https://10.254.0.3:8443 | NO | ONLINE | fully operational |
+------------+-------------------------+----------+--------+-------------------+
If you notice, the pending
column for alpha-0003
in the nodes
table is set to 1
, suggesting that it did not successfully join. That's the part you need to check: what error did it throw (if any)? What do you see in the logs?
Same result as this: https://github.com/lxc/lxd/issues/6313#issuecomment-547452963
t=2019-11-06T14:25:31+0000 lvl=eror msg="Failed to get leader node address: Node is not clustered"
t=2019-11-06T14:25:31+0000 lvl=info msg="Joining dqlite raft cluster" address=10.254.0.5:8443 id=2 target=10.254.0.5:8443
t=2019-11-06T14:25:35+0000 lvl=eror msg="Empty raft node set received"
t=2019-11-06T14:25:36+0000 lvl=warn msg="Dqlite client proxy TLS -> Unix: read tcp 10.254.0.5:45070->10.254.0.4:8443: use of closed network connection"
I've been trying this from snapshots a lot of times now, always getting these errors. 1 try ago though, setting up the cluster suddenly worked, so this problem seems performance (zfs?) related.
Required information
Issue description
When i try to create a cluster in LXD containers I seem to keep running into the below when I try to add the second node