Tested clustering limitations

Platform: CentOS7 etcd version: 3.3.11 Description: I'm developing a distributed appliance management application that has a need for reliably sharing/replicating data across multiple servers/VMs, without the heavy orchestration of a typical database. We're already using Atomic OS (hopefully Redhat/Core OS when its fully ready), so etcd was a nature first step.

My design is supposed to be automated and allow for 1 to 10+ appliance installations where key node specific data is replicated to other nodes. I stood up a 4 node cluster with full TLS and some user/role configuration to play with. I went as far as developing a library to read/write/export the data structures my application requires. I was expecting that if a node was unable to reach the rest of the cluster, its store would be available locally and I could codify how I want my appliance to deal with being "offline" for too long, until it re-establishes it availability.

I'm going to describe some of the limitations I've encountered with clustered etcd, and maybe someone can comment whether I'm using it wrong, or etcd is for a different audience. This isn't a compliant session, I just want to understand if I'm using the wrong solution.

Availability/fail-over is impossible when only 2 nodes are in a cluster: I started with a 2 node configuration and setup a working cluster (eg put/get data from one to the other, replication works, etc...) When one goes down, the other enters endless leader election mode. It's the only one, yet raft seems to need at least 1 other node available to elect a leader. So fail-over needs at least 3 etcd nodes, so if 1 goes down, the other 2 can pick a leader. What if a 4 node cluster is disconnected (2 + 2 because a VPN or router is down)? Are all 4 nodes unusable? I'm really surprised by this limitation. I must be doing something wrong.
Joining a new etcd node to a cluster requires almost all details about the existing cluster: When joining a new node, the member needs to be added in advance (with its peering connection details), then on the new node, you have to start etcd with cluster-state "existing" and initial-cluster to contain exactly all nodes in the cluster with their names and peer urls. This is a pretty synchronized procedure because after you join, everything is stored in the data directory. So subsequent starts don't need any cluster configuration at all. I find this makes adding a node kind of fragile, and kind of unnecessary as the cluster already knows about you (you had to add them as a member). I can automate this, but it seems unnecessary if the new node gets the member list from the cluster and you added this member already.
During the leader election process you can't access any node's keys/values: When a node goes down, no get/put requests are serviced. If you only have 2 nodes in the cluster, the election process never ends (see 1 above), so your application will never get access to that data, resulting in a bit of a cascading failure.

I included my scripting below used (I left out the TSL configuration bits) in my testing:

# On the 1st node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload

export CL_NAME=etcd1
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export TOKEN=$(date +%s | sha256sum | base64 | head -c 32 ; echo)

# turn on etcdctl v3 api support, why is this not default?!
export ETCDCTL_API=3

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=http://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN

# Next come up with a name for the next node, etcd2 then etcd3, etc... Get their hostname/IP and add them as a future member

etcdctl --endpoints="https://127.0.0.1:2379" member add etcd2 --peer-urls="http://<next node's IP address>:2380"

# 1st etcd is now running, with peering available and members 
# added for next nodes
# copy "echo $TOKEN" for next steps where its needed

# On the 1st/next node (I used Centos7 minimal, with etcd installed)
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --reload

export CL_NAME=etcd2
export HOST=$(hostname)
export IP_ADDR=$(ip -4 addr show ens33 | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export ETCDCTL_API=3

export TOKEN=<TOKEN string from above>

sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://$IP_ADD:2380"

# NOTE the --initial-cluster ALWAYS MUST have all nodes in the
# cluster, with their names and peer urls, otherwise it won't join

# Here's an example for the 3 node
sudo etcd --name $CL_NAME --data-dir ~/data --advertise-client-urls=https://127.0.0.1:2379,https://$IP_ADDR:2379 --listen-client-urls=https://0.0.0.0:2379 --initial-advertise-peer-urls https://$IP_ADDR:2380 --listen-peer-urls https://$IP_ADDR:2380 --initial-cluster-state new --initial-cluster-token $TOKEN --initial-cluster="etcd1=http://<IP of 1st node>:2380,etcd2=http://<IP of 2nd node>:2380,etcd3=http://<IP of 3rd node>:2830"

When you get down to 2 running nodes (kill one of them), a new leader will be elected right away. When you're down to 1, the entire effort it useless. If you bring 1 up again, it'll all work again.

I did a quick pass on the limitation you described. Seems most of them is due to the nature of the consensus protocol. To ensure correctness, cluster does require majority of servers to be available. A fault tolerance table can be found here [1]. To understand why majority is required (and leader's presence is required), please refer to the Raft consensus protocol used by etcd [2]. These answers your question (1) and (3). Regarding (2), it is deliberately designed to be the case, to prevent new server accidentally joins an existing cluster and contaminates the data.

[1] https://github.com/etcd-io/etcd/blob/bfcd590f059556323c6f332d2abec909831f4d0e/Documentation/v2/admin_guide.md#fault-tolerance-table [2] https://github.com/ongardie/dissertation/blob/master/book.pdf

Thanks for the quick reply jingyih. First, I don't want you or anyone to take my responses negatively. I'm just trying to put emphasis on how these limitations really make etcd suitable for use cases which the real world will easily extend beyond. etcd has good adoption, but I don't feel that these real world limitations are as clear or up front as they should be.

For (3), none of those references justify the real world implication of not allowing a single, disconnected client from servicing client requests. etcd is a lower level service. Low enough that the applications integrating with it must accept a modicum of responsibility in ensuring the type of data stored/read is suitable. So if a "status" key/value is replicated, I would typically check the age of the value and implement my own application level "freshness" indicator to the user, etc... I would rather see a temporary orphaned node be able to service local client requests, than ruin any application's ability to work at all because etcd is electing a leader in a 1 node situation. etcd's inability to service clients in this case shouldn't pause anything that might need a 6 month old "debug=false" value.

For (2), the "member add" is totally understandable, and coupled with TLS peer client auth, its pretty safe to join nodes nicely, automatically. However the requirement to specify all nodes in the existing cluster, plus the new one, in the --initial-cluster for the new node is silly. Nodes 1, 2 and 3 all will be replicating the "member add", and the new node will be that same data replicating as soon as its able to. Requiring a new node to increasingly specify all other nodes in the cluster in order for it to join isn't good usability. By the time I add a 20th node, the --initial-cluster string will be a page long. I also worry about whether I should include every peer url with hostname/fqdn/IPs for each node in the "command", "member add", etc... If they change, etc...

This can and should be streamlined with peer client auth (with ca certificates, not self-signed) as the mechanism to permit new nodes from joining. Maybe a discovery service side steps this, but it doesn't absolve it.

For (1), the RAFT paper and admin guides don't excuse the service's inability to operate with only 1 node, when you know, if you setup 1 node without peering it works fine as 1 node. What's the point of 2 nodes if you need 3 just to handle 1 node failing? A colleague joked that I should run 3 etcd docker containers on every server.

The RAFT leader/election protocol in the paper like all academic papers, tries to a solve problem narrower than the real world, and can't/doesn't take responsibility for anything outside its scope, or a real world implementation. (do you think Sergey and Larry still use PageRank at Google for all search or was it just the basis for an approach that's been way blown up by real world realities?)

etcd is a real world capability, that kubernetes and others use. These limitations affect everyone that builds anything on top of this project.

Thanks for listening.

No problem. Happy to help.

For (3), none of those references justify the real world implication of not allowing a single, disconnected client from servicing client requests. etcd is a lower level service. Low enough that the applications integrating with it must accept a modicum of responsibility in ensuring the type of data stored/read is suitable. So if a "status" key/value is replicated, I would typically check the age of the value and implement my own application level "freshness" indicator to the user, etc... I would rather see a temporary orphaned node be able to service local client requests, than ruin any application's ability to work at all because etcd is electing a leader in a 1 node situation. etcd's inability to service clients in this case shouldn't pause anything that might need a 6 month old "debug=false" value.

A disconnected node could still serve certain types of API requests. For example, if you just want a stale data, specify --consistency=s. By default the range requests are guaranteed to be linearizable, i.e., --consistency=l, which means it cannot be served by a node who is disconnected from the majority of the cluster.

For (2), the "member add" is totally understandable, and coupled with TLS peer client auth, its pretty safe to join nodes nicely, automatically. However the requirement to specify all nodes in the existing cluster, plus the new one, in the --initial-cluster for the new node is silly. Nodes 1, 2 and 3 all will be replicating the "member add", and the new node will be that same data replicating as soon as its able to. Requiring a new node to increasingly specify all other nodes in the cluster in order for it to join isn't good usability. By the time I add a 20th node, the --initial-cluster string will be a page long. I also worry about whether I should include every peer url with hostname/fqdn/IPs for each node in the "command", "member add", etc... If they change, etc...

Maybe search for etcd operator which will help create and manage an etcd cluster. For initial bootstrapping, please also refer to etcd discovery service [1]. I think there might be other options which could help automate the cluster creation and reconfiguration. But I am not experienced in this area.

For (1), the RAFT paper and admin guides don't excuse the service's inability to operate with only 1 node, when you know, if you setup 1 node without peering it works fine as 1 node. What's the point of 2 nodes if you need 3 just to handle 1 node failing? A colleague joked that I should run 3 etcd docker containers on every server.

As stated in the doc, it is recommended to have an odd number of members in a cluster. As you noticed, even number of members does not provide extra failure tolerance. This is determined by the underlying algorithm, which does not change whether it is academic paper or real world implementation.

[1] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/clustering.md#discovery

etcd is a real world capability, that kubernetes and others use. These limitations affect everyone that builds anything on top of this project.

You are absolutely right. etcd is a real world capability. It has been evolved with Kubernetes and other projects along the way. However it is important to realize that some of the limitations are fundamental and are required by correctness.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

etcd-io / etcd

Tested clustering limitations #11065