Closed RVCU closed 5 years ago
@freeekanayaka can you take a look?
There is actually some chance that this issue might have been fixed by #5898. Please could you make sure your edge snap is up to date on all nodes (snap refresh lxd) and see if the problem happens again?
@RVCU please let us know if current edge still shows that problem for you
Hi, all my nodes updated over the weekend
lxd git-d4623dc 11071 edge canonical✓ -
So I am going to put the cluster back together and will let you know if the error persists.
I was able to recreate the issue by adding 5 nodes back to the cluster and running some containers.
Error: Local node name: not leader
lxd_admin@lxd-cloud-01:~$ lxc ls
Error: cannot fetch node config from database: not leader
from the LXD log on one of the database nodes
t=2019-07-04T15:04:35+0000 lvl=dbug msg=Handling ip=@ method=GET url="/1.0/containers?recursion=2" user=
t=2019-07-04T15:04:35+0000 lvl=dbug msg="Database error: Local node name: not leader"
t=2019-07-04T15:04:35+0000 lvl=dbug msg="DBERR: containersGet: error \"Local node name: not leader\""
t=2019-07-04T15:04:35+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"Local node name: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:04:35+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:04:35+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:04:36+0000 lvl=dbug msg="Database error: Local node name: not leader"
t=2019-07-04T15:04:36+0000 lvl=eror msg="Failed to load containers for scheduled snapshots" err="Local node name: not leader"
t=2019-07-04T15:04:36+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:36+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:37+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:37+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:38+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:38+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:39+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:39+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:40+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:40+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:41+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:41+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:42+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:42+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:43+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:04:43+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader" ```
Also I don't seem to run into this error until I add more than 3 nodes to the cluster though that could just be a red herring as I only ran a couple tests with just 3 nodes in the cluster.
Some more logs: after restarting one of the db nodes
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"
But interestingly enough it seems to knows what containers are on what host when I use the lxc cluster remove command
$ lxc cluster remove lxd-cloud-05
Error: Node still has the following containers: automation-2-lxd-reporting, automation-2-lxd-ambari, automation-4-lxd-master, automation-4-lxd-reporting, automation-4-lxd-search
@freeekanayaka ^
I merged a branch into dqlite that should hopefully fix this (see https://github.com/CanonicalLtd/dqlite/pull/136), or at least it should make LXD retry the transaction (as it's supposed to do in these cases). It will be built by our CI in the next hours and automatically included into a new revision of the edge snap.
It's still a bit weird that you get this error in the first place. When is the error happening? Right after adding a new node? Or even after the cluster has been stable for a while and no nodes have been added or removed?
If you got this error when the cluster is stable, it might be worth to investigate further, since it means that the cluster leader was not able to maintain leadership, most probably because it was not able to communicate with the rest of the nodes in a fast enough way. So it'd be either a network issue on your part, or the default timeouts in LXD are too aggressive and cause spurious leader demotions.
@freeekanayaka I get the error more frequently when I add new nodes but it also happens after the cluster has been stable for some time. I believe it's most likely network issues on my end as I am noticing dropped connections to and between the cluster nodes now. Is there be a way to configure the timeouts so I can see if that helps?
I appreciate the help, I'll update to the latest edge when it's available.
@RVCU unfortunately those timeouts are currently not exposed.
First of all, please let me know if the latest edge update solves the issue (in terms of handling these disconnections gracefully).
Secondly, my goal was to somehow be able to adjust those timeouts automatically, based on error statistics collected at runtime (e.g. make the timeouts less aggressive if too many errors happen over a certain period of time). But I don't think I will have time to work on that soon.
At the moment these timeouts are: 3 seconds before a node decides that the leader has died (since it has received no message from it). The leader sends a message to every other node every 500 milliseconds, so you need 6 lost messages over 3 seconds in order to have disruption. These values are very conservative and have been picked to tolerate most network setups. I'm really surprised this is not relaxed enough in your case. Please could you elaborate on why your network would fail to sustain such load? Is it extremely saturated?
@freeekanayaka I updated to the latest edge and had good results for about a day before the problem re-occured. Now LXD is stuck in this loop indefinitely on one of the db nodes
t=2019-07-09T01:15:31+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T01:15:31+0000 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.18.1.33:8443} {ID:2 Address:172.18.0.182:8443} {ID:3 Address:172.18.0.116:8443}]"
t=2019-07-09T01:15:47+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T01:15:47+0000 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.18.1.33:8443} {ID:2 Address:172.18.0.182:8443} {ID:3 Address:172.18.0.116:8443}]"
I don't think the network is being saturated, but we are experiencing some dropped connections that might be putting the db in a bad state and it is failing to recover afterwards.
The current failure is such that only one of my db nodes is failing to recover and is even failing to restart with sudo snap restart lxd
. The other nodes seem to be fine and are showing the failed node (lxd-cloud-01) as online
+--------------+---------------------------+----------+--------+-------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-01 | https://172.18.1.33:8443 | YES | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-02 | https://172.18.0.116:8443 | YES | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-03 | https://172.18.0.182:8443 | YES | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-04 | https://172.18.0.208:8443 | NO | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-05 | https://172.18.0.101:8443 | NO | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-06 | https://172.18.0.134:8443 | NO | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
but the containers running on lxd-cloud-01 show up as STATE=ERROR when I run lxc ls
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-master | ERROR | | | PERSISTENT | | lxd-cloud-01 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-reporting | RUNNING | 240.0.116.89 (eth0) | | PERSISTENT | | lxd-cloud-02 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-search | ERROR | | | PERSISTENT | | lxd-cloud-01 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-stream | RUNNING | 240.0.116.137 (eth0) | | PERSISTENT | | lxd-cloud-02 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
here is network info from one of the running nodes if it matters
Name: lxdfan0
MAC address: 06:d1:fc:d5:62:71
MTU: 1450
State: up
Ips:
inet 240.0.208.1
inet6 fe80::b4ee:d3ff:fe99:e1d
Network usage:
Bytes received: 93.57MB
Bytes sent: 24.96GB
Packets received: 1787824
Packets sent: 3869477
Let me know if there is anything else you need or if there is a way I can debug what is happening with the db.
Looking through the logs I think I found the spot where the db died and couldn't restart
t=2019-07-09T11:20:11+0000 lvl=dbug msg=Handling ip=172.18.0.208:54348 method=GET url=/1.0/networks/lxdfan0/state user=
t=2019-07-09T11:20:11+0000 lvl=dbug msg="\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"addresses\": [\n\t\t\t\t{\n\t\t\t\t\t\"family\": \"inet\",\n\t\t\t\t\t\"address\": \"240.1.33.1\",\n\t\t\t\t\t\"netmask\": \"8\",\n\t\t\t\t\t\"scope\": \"global\"\n\t\t\t\t},\n\t\t\t\t{\n\t\t\t\t\t\"family\": \"inet6\",\n\t\t\t\t\t\"address\": \"fe80::387f:f4ff:fe4e:b87\",\n\t\t\t\t\t\"netmask\": \"64\",\n\t\t\t\t\t\"scope\": \"link\"\n\t\t\t\t}\n\t\t\t],\n\t\t\t\"counters\": {\n\t\t\t\t\"bytes_received\": 69342019,\n\t\t\t\t\"bytes_sent\": 21691212517,\n\t\t\t\t\"packets_received\": 1329270,\n\t\t\t\t\"packets_sent\": 2459091\n\t\t\t},\n\t\t\t\"hwaddr\": \"1e:c8:8c:b0:d9:7d\",\n\t\t\t\"mtu\": 1450,\n\t\t\t\"state\": \"up\",\n\t\t\t\"type\": \"broadcast\"\n\t\t}\n\t}"
t=2019-07-09T11:20:15+0000 lvl=warn msg="Error during dqlite proxy copy: tls: use of closed connection"
t=2019-07-09T11:20:15+0000 lvl=dbug msg="Database error: failed to begin transaction: failed to receive response: failed to receive header: failed to receive header: EOF"
t=2019-07-09T11:20:15+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: failed to receive response: failed to receive header: failed to receive header: EOF"
t=2019-07-09T11:20:15+0000 lvl=dbug msg="Disconnected event listener: 02111824-1915-4174-a82f-49f9fcf0b1a8"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: closing client"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.18.1.33:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.18.0.182:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://172.18.0.116:8443/internal/database: dial tcp 172.18.0.116:8443: connect: connection refused address=172.18.0.116:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Found cert" name=0
Hey, I updated to 3.15 edge last Friday and after an initial crash during upgrade I haven't had any database issues since.
Good to hear, let's hope whatever the source of this got squashed, if not, let us know and we'll dig deeper.
Required information
The output of "lxc info":
Issue description
Upgraded from 3.14 candidate snap to edge to get the new database code for my LXD cluster. Worked fine for a day and now seemingly can't figure out who the database leader is?
Steps to reproduce
not sure how reproducible this is but this is what I did
snap remove lxd
on all nodes in the LXD clustersnap install --channel=edge lxd
on all the nodes in LXD clusterlxc
commands fail withError: cannot fetch node config from database: not leader
Information to attach