canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.36k stars 932 forks source link

'Failed to get current cluster nodes: Failed to fetch nodes: not leader' #5896

Closed RVCU closed 5 years ago

RVCU commented 5 years ago

Required information

Issue description

Upgraded from 3.14 candidate snap to edge to get the new database code for my LXD cluster. Worked fine for a day and now seemingly can't figure out who the database leader is?

Steps to reproduce

not sure how reproducible this is but this is what I did

  1. snap remove lxd on all nodes in the LXD cluster
  2. snap install --channel=edge lxd on all the nodes in LXD cluster
  3. wait a day
  4. database has corrupted and now all lxc commands fail with Error: cannot fetch node config from database: not leader

Information to attach

stgraber commented 5 years ago

@freeekanayaka can you take a look?

freeekanayaka commented 5 years ago

There is actually some chance that this issue might have been fixed by #5898. Please could you make sure your edge snap is up to date on all nodes (snap refresh lxd) and see if the problem happens again?

stgraber commented 5 years ago

@RVCU please let us know if current edge still shows that problem for you

RVCU commented 5 years ago

Hi, all my nodes updated over the weekend

lxd   git-d4623dc  11071  edge      canonical✓  -

So I am going to put the cluster back together and will let you know if the error persists.

RVCU commented 5 years ago

I was able to recreate the issue by adding 5 nodes back to the cluster and running some containers.

Error: Local node name: not leader
lxd_admin@lxd-cloud-01:~$ lxc ls
Error: cannot fetch node config from database: not leader

from the LXD log on one of the database nodes


t=2019-07-04T15:04:35+0000 lvl=dbug msg=Handling ip=@ method=GET url="/1.0/containers?recursion=2" user=                                                                                                                                                                                          
t=2019-07-04T15:04:35+0000 lvl=dbug msg="Database error: Local node name: not leader"                                                                                                                                                                                                             
t=2019-07-04T15:04:35+0000 lvl=dbug msg="DBERR: containersGet: error \"Local node name: not leader\""                                                                                                                                                                                             
t=2019-07-04T15:04:35+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"Local node name: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                                          
t=2019-07-04T15:04:35+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:04:35+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:04:36+0000 lvl=dbug msg="Database error: Local node name: not leader"                                                                                                                                                                                                             
t=2019-07-04T15:04:36+0000 lvl=eror msg="Failed to load containers for scheduled snapshots" err="Local node name: not leader"                                                                                                                                                                     
t=2019-07-04T15:04:36+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:36+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:37+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:37+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:38+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:38+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:39+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:39+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:40+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:40+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:41+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:41+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:42+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:42+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:04:43+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:04:43+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"  ```
RVCU commented 5 years ago

Also I don't seem to run into this error until I add more than 3 nodes to the cluster though that could just be a red herring as I only ran a couple tests with just 3 nodes in the cluster.

Some more logs: after restarting one of the db nodes

t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"                                                                                                                                         
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"                                                                                                                                                                                       
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"                                                                                                                                                                                                    
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"                                                                                                                                         
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"                                                                                                                                                                                       
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"                                                                                                                                                                                                    
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: Failed to fetch nodes: not leader"                                                                                                                                                                                                       
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to get current cluster nodes: Failed to fetch nodes: not leader"                                                                                                                                                                                  
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"cannot fetch node config from database: not leader\"}"                                                                                                                                                           
t=2019-07-04T15:14:21+0000 lvl=dbug msg="\n\t{\n\t\t\"error\": \"cannot fetch node config from database: not leader\",\n\t\t\"error_code\": 500,\n\t\t\"type\": \"error\"\n\t}"                                                                                                                   
t=2019-07-04T15:14:21+0000 lvl=warn msg="Failed to rollback transaction after error (driver: bad connection): cannot rollback - no transaction is active"                                                                                                                                         
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Database error: &errors.errorString{s:\"driver: bad connection\"}"                                                                                                                                                                                       
t=2019-07-04T15:14:21+0000 lvl=dbug msg="Retry failed db interaction (driver: bad connection)"     

But interestingly enough it seems to knows what containers are on what host when I use the lxc cluster remove command

$ lxc cluster remove lxd-cloud-05
Error: Node still has the following containers: automation-2-lxd-reporting, automation-2-lxd-ambari, automation-4-lxd-master, automation-4-lxd-reporting, automation-4-lxd-search
stgraber commented 5 years ago

@freeekanayaka ^

freeekanayaka commented 5 years ago

I merged a branch into dqlite that should hopefully fix this (see https://github.com/CanonicalLtd/dqlite/pull/136), or at least it should make LXD retry the transaction (as it's supposed to do in these cases). It will be built by our CI in the next hours and automatically included into a new revision of the edge snap.

It's still a bit weird that you get this error in the first place. When is the error happening? Right after adding a new node? Or even after the cluster has been stable for a while and no nodes have been added or removed?

If you got this error when the cluster is stable, it might be worth to investigate further, since it means that the cluster leader was not able to maintain leadership, most probably because it was not able to communicate with the rest of the nodes in a fast enough way. So it'd be either a network issue on your part, or the default timeouts in LXD are too aggressive and cause spurious leader demotions.

RVCU commented 5 years ago

@freeekanayaka I get the error more frequently when I add new nodes but it also happens after the cluster has been stable for some time. I believe it's most likely network issues on my end as I am noticing dropped connections to and between the cluster nodes now. Is there be a way to configure the timeouts so I can see if that helps?

I appreciate the help, I'll update to the latest edge when it's available.

freeekanayaka commented 5 years ago

@RVCU unfortunately those timeouts are currently not exposed.

First of all, please let me know if the latest edge update solves the issue (in terms of handling these disconnections gracefully).

Secondly, my goal was to somehow be able to adjust those timeouts automatically, based on error statistics collected at runtime (e.g. make the timeouts less aggressive if too many errors happen over a certain period of time). But I don't think I will have time to work on that soon.

At the moment these timeouts are: 3 seconds before a node decides that the leader has died (since it has received no message from it). The leader sends a message to every other node every 500 milliseconds, so you need 6 lost messages over 3 seconds in order to have disruption. These values are very conservative and have been picked to tolerate most network setups. I'm really surprised this is not relaxed enough in your case. Please could you elaborate on why your network would fail to sustain such load? Is it extremely saturated?

RVCU commented 5 years ago

@freeekanayaka I updated to the latest edge and had good results for about a day before the problem re-occured. Now LXD is stuck in this loop indefinitely on one of the db nodes

t=2019-07-09T01:15:31+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T01:15:31+0000 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.18.1.33:8443} {ID:2 Address:172.18.0.182:8443} {ID:3 Address:172.18.0.116:8443}]" 
t=2019-07-09T01:15:47+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T01:15:47+0000 lvl=dbug msg="Replace current raft nodes with [{ID:1 Address:172.18.1.33:8443} {ID:2 Address:172.18.0.182:8443} {ID:3 Address:172.18.0.116:8443}]" 

I don't think the network is being saturated, but we are experiencing some dropped connections that might be putting the db in a bad state and it is failing to recover afterwards. The current failure is such that only one of my db nodes is failing to recover and is even failing to restart with sudo snap restart lxd. The other nodes seem to be fine and are showing the failed node (lxd-cloud-01) as online

+--------------+---------------------------+----------+--------+-------------------+
|     NAME     |            URL            | DATABASE | STATE  |      MESSAGE      |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-01 | https://172.18.1.33:8443  | YES      | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-02 | https://172.18.0.116:8443 | YES      | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-03 | https://172.18.0.182:8443 | YES      | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-04 | https://172.18.0.208:8443 | NO       | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-05 | https://172.18.0.101:8443 | NO       | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+
| lxd-cloud-06 | https://172.18.0.134:8443 | NO       | ONLINE | fully operational |
+--------------+---------------------------+----------+--------+-------------------+

but the containers running on lxd-cloud-01 show up as STATE=ERROR when I run lxc ls

+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-master    | ERROR   |                      |      | PERSISTENT |           | lxd-cloud-01 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-reporting | RUNNING | 240.0.116.89 (eth0)  |      | PERSISTENT |           | lxd-cloud-02 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-search    | ERROR   |                      |      | PERSISTENT |           | lxd-cloud-01 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+
| automation-3-lxd-stream    | RUNNING | 240.0.116.137 (eth0) |      | PERSISTENT |           | lxd-cloud-02 |
+----------------------------+---------+----------------------+------+------------+-----------+--------------+

here is network info from one of the running nodes if it matters

Name: lxdfan0
MAC address: 06:d1:fc:d5:62:71
MTU: 1450
State: up

Ips:
  inet  240.0.208.1
  inet6 fe80::b4ee:d3ff:fe99:e1d

Network usage:
  Bytes received: 93.57MB
  Bytes sent: 24.96GB
  Packets received: 1787824
  Packets sent: 3869477

Let me know if there is anything else you need or if there is a way I can debug what is happening with the db.

RVCU commented 5 years ago

Looking through the logs I think I found the spot where the db died and couldn't restart

t=2019-07-09T11:20:11+0000 lvl=dbug msg=Handling ip=172.18.0.208:54348 method=GET url=/1.0/networks/lxdfan0/state user=
t=2019-07-09T11:20:11+0000 lvl=dbug msg="\n\t{\n\t\t\"type\": \"sync\",\n\t\t\"status\": \"Success\",\n\t\t\"status_code\": 200,\n\t\t\"operation\": \"\",\n\t\t\"error_code\": 0,\n\t\t\"error\": \"\",\n\t\t\"metadata\": {\n\t\t\t\"addresses\": [\n\t\t\t\t{\n\t\t\t\t\t\"family\": \"inet\",\n\t\t\t\t\t\"address\": \"240.1.33.1\",\n\t\t\t\t\t\"netmask\": \"8\",\n\t\t\t\t\t\"scope\": \"global\"\n\t\t\t\t},\n\t\t\t\t{\n\t\t\t\t\t\"family\": \"inet6\",\n\t\t\t\t\t\"address\": \"fe80::387f:f4ff:fe4e:b87\",\n\t\t\t\t\t\"netmask\": \"64\",\n\t\t\t\t\t\"scope\": \"link\"\n\t\t\t\t}\n\t\t\t],\n\t\t\t\"counters\": {\n\t\t\t\t\"bytes_received\": 69342019,\n\t\t\t\t\"bytes_sent\": 21691212517,\n\t\t\t\t\"packets_received\": 1329270,\n\t\t\t\t\"packets_sent\": 2459091\n\t\t\t},\n\t\t\t\"hwaddr\": \"1e:c8:8c:b0:d9:7d\",\n\t\t\t\"mtu\": 1450,\n\t\t\t\"state\": \"up\",\n\t\t\t\"type\": \"broadcast\"\n\t\t}\n\t}"
t=2019-07-09T11:20:15+0000 lvl=warn msg="Error during dqlite proxy copy: tls: use of closed connection"
t=2019-07-09T11:20:15+0000 lvl=dbug msg="Database error: failed to begin transaction: failed to receive response: failed to receive header: failed to receive header: EOF"
t=2019-07-09T11:20:15+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: failed to receive response: failed to receive header: failed to receive header: EOF"
t=2019-07-09T11:20:15+0000 lvl=dbug msg="Disconnected event listener: 02111824-1915-4174-a82f-49f9fcf0b1a8"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: closing client"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Found cert" name=0
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.18.1.33:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: 503 Service Unavailable address=172.18.0.182:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: server connection failed err=failed to establish network connection: Head https://172.18.0.116:8443/internal/database: dial tcp 172.18.0.116:8443: connect: connection refused address=172.18.0.116:8443 attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Dqlite: connection failed err=no available dqlite leader server found attempt=0"
t=2019-07-09T11:20:16+0000 lvl=dbug msg="Found cert" name=0
RVCU commented 5 years ago

Hey, I updated to 3.15 edge last Friday and after an initial crash during upgrade I haven't had any database issues since.

stgraber commented 5 years ago

Good to hear, let's hope whatever the source of this got squashed, if not, let us know and we'll dig deeper.