etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.79k stars 9.77k forks source link

Consistent reads are not consistent #741

Closed aphyr closed 10 years ago

aphyr commented 10 years ago

Etcd aims to provide linearizable registers supporting reads, writes, and CAS operations. In a linearizable system, all operations must appear to take place at a specific, atomic point in time between the invocation and completion of that operation. This is not the case: reads in etcd can return stale versions of nodes.

In short, it's possible to write "foo", write "bar", then read "foo".

Is this a theoretical or real problem?

Real. I can reproduce it reliably on a five-node cluster with only a small number of clients performing five concurrent ops per second. Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue. I have a failing case in Jepsen which triggers this behavior in roughly 20 seconds; see the client, test, and a sample run with analysis.

The window of inconsistency is relatively short--sub-second--but appears reliably during transition.

Why does this happen?

For performance reasons, when a read request arrives on a node which believes itself to be the leader, etcd reads the current state from the Raft FSM and returns it to the client. This is fast and simple, but not quite correct.

Raft allows multiple nodes to believe themselves the leader concurrently. If a client makes a request to a leader which has recently been isolated, that leader may believe itself to be the leader even though a new leader has been elected and is accepting new writes. Clients writing to one side of etcd will find that their writes are not reflected on the old leader, until the old leader steps down.

How do we fix this?

Consistent reads must wait for at least one network round-trip; the leader needs acknowledgement from a quorum of followers that it is still the leader in order to service a read. One way to do this is to make a entry for the read operation to the Raft log, and when it has been committed, send the read back to the client.

A more subtle way to fix this problem is to (invisibly) piggyback the read on the state of transactions the leader is already executing. In particular, the leader can track an outgoing unrelated log entry, and if it successfully commits, use that information to assert that it is still the leader. I believe the linearizability window for this strategy allows reads to proceed concurrently with writes; the leader looks at its local FSM, chooses the value X it thinks is valid, waits for a commit to propagate, then returns X to the client.

It might also be possible to simply wait for a heartbeat from all follower nodes, confirming that they still believe the leader is valid. I'm not quite sure about the safety of this one.

What are the costs?

At least one round-trip latency for each read. This makes reads only a little less expensive than writes, in latency terms, but since they don't need to touch the log directly, doesn't balloon the size of the log. All required state can be kept in a single leader's memory.

Since the HTTP API terms these reads "consistent", I strongly advise making them consistent. It may be worth having three consistency tiers:

Incidentally, I believe Consul is subject to a similar problem, but haven't verified experimentally yet. I hope you find this helpful! :)

xiang90 commented 10 years ago

@aphyr Thanks for the testing. I have replied for this once at https://gist.github.com/xiangli-cmu/d6c59c42d72d3582db9d. I was thinking you were talking about the linearization of CAS operation on twitter, which made a a little confused. Anyway, I realized what you were talking about.

I think Consistency != linearization. etcd provides sequential consistency for all the write and read operations(in etcd we do not have write operation: we only have write and read/ read) that go thorough the state machine. We never aim at strict consistency (one-copy semantics). We do a read redirection to leader just to ensure the client that send out the previous write operation can see the result by a subsequent read operation.

I totally understand the problem you are saying and am aware of how to fix this. But it is just a tradeoff we made long time ago. Actually not only the leader changes can cause the problem, two concurrency write/read operations from different clients can also lead to stale read in etcd.

How to fix? We can just simply let read also go through raft without persistent it to disk. We do not even need to record that in memory anyway. If we want we can make it happen very soon.

How to handle this in etcd: In etcd, we attach an index to each response. It is the logical clock in etcd. We ensure linearization in logical clock perspective. You can see write k->foo index = 1, read k->bar index = 0, read k->foo index=1 in the real time perspective. But when you reorder them with the etcd time(index) and analyze them, they should be linearizable.

Again: etcd tries to ensure sequential consistency, which means each replica have the same command execution ordering. With consistent read, we ensure the strict consistency for each separated client in most cases (leader dies, new leader might has the command but not commit yet.).

xiang90 commented 10 years ago

@aphyr For the leader shifting issue, please tune your etcd cluster according to: https://github.com/coreos/etcd/blob/master/Documentation/tuning.md.

In your case, you need to loose the leader election timeout to second level.

aphyr commented 10 years ago

etcd tries to ensure sequential consistency, which means each replica have the same command execution ordering.

etcd provides sequential consistency for all the write and read operations(in etcd we do not have write operation: we only have write and read/ read)

I think you can actually claim linearizability so long as all operations are write-and-read, but if you allow a process to perform a nominally consistent read, etcd is neither linearizable nor sequentially consistent. Consider this sequence of events:

  1. Client 1 writes A
  2. A leader transition occurs; both leaders have state A
  3. Client 1 writes B
  4. New leader has B, old leader has A; both consider themselves valid targets for reads
  5. Client 2 reads A from the old leader
  6. Client 2 reads B from the new leader
  7. Client 3 reads B from the new leader
  8. Client 3 reads A from the old leader

In this history, Client 2 sees [A B], but Client 3 sees [B A]. Sequential consistency is violated.

You can recover sequential consistency by forcing all consistent reads to include an index, and only satisfying a read when the node sees a committed index greater than or equal to the requested index, but this does have the drawback that clients are required to implement a local index FSM in order to provide a sequentially consistent API. You'll have to communicate that constraint clearly to client authors and other etcd API users.

There's a larger question of whether users really consider sequential consistency to be "consistent". Users may not understand that a "consistent read" may actually provide stale data. If you choose this model, I suggest that in the documentation, everywhere the term "consistent read" is used, you explain exactly what behaviors are allowable. I'll bet you dollars to donuts that you've got a significant (if not majority) fraction of the user base assuming that consistent reads are linearizable. I did, and I think about these things carefully, haha. :)

For example, http://the.randomengineer.com/2014/01/27/using-etcd-as-a-clusterized-immediately-consistent-key-value-storage/ claims etcd is "immediately consistent", and could be forgiven for thinking so, as etcd's read consistency documentation says:

If your application wants or needs the most up-to-date version of a key then it should ensure it reads from the current leader. By using the consistent=true flag in your GET requests, etcd will make sure you are talking to the current master.

... which strongly implies that consistent=true ensures one will read the most up-to-date value.

two concurrency write/read operations from different clients can also lead to stale read in etcd.

This may actually be OK; as long as the operations are concurrent, returning either the old or new value satisfies linearizability. :)

But when you reorder them with the etcd time(index) and analyze them, they should be linearizable.

This is true, but if clients are supposed to reorder operations in etcd index order, you've got a.) some serious documentation work to do, and b.) a lot of misbehaving clients. ;-)

@aphyr For the leader shifting issue, please tune your etcd cluster according to: https://github.com/coreos/etcd/blob/master/Documentation/tuning.md. In your case, you need to loose the leader election timeout to second level.

This is how Consul got their system to pass Jepsen too, but adjusting the timeouts won't make etcd correct--it just makes the problem harder to demonstrate. Timeouts can help optimize performance, but without a realtime environment, they cannot ensure correctness.

Anyway, none of this is to say that etcd is wrong or bad; there are, as you've observed, good reasons to offer "mostly consistent" reads. You just gotta document those decisions. One easy way to do that would be to change the parameter for this read behavior to mostly-consistent.

xiang90 commented 10 years ago

I think you can actually claim linearizability so long as all operations are write-and-read, but if you allow a process to perform a nominally consistent read, etcd is neither linearizable nor sequentially consistent. Consider this sequence of events

Sequential consistence means replica execute the commands in the same order. We do not execute read, so there is no guarantee.

In etcd, we return the result of a write. So actually it is a write and read. That is why we can safely return an index.

This may actually be OK; as long as the operations are concurrent, returning either the old or new value satisfies linearizability. :)

I think this violates linearizability.

If you have a clear use case for us to execute read in the state machine, we definitely would like to add it. :)

This is how Consul got their system to pass Jepsen too, but adjusting the timeouts won't make etcd correct--it just makes the problem harder to demonstrate. Timeouts can help optimize performance, but without a realtime environment, they cannot ensure correctness.

I am not saying I want to cheat. This explains:

Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue

xiang90 commented 10 years ago

@aphyr We would like to change the doc as you suggest. It makes things clearer. I do not want to mislead people too.

aphyr commented 10 years ago

Sequential consistence means replica execute the commands in the same order. We do not execute read, so there is no guarantee.

Aha! I think we may be operating under different concepts of sequential consistency. May I suggest this INRIA paper on causal, sequential, and linearizable consistency? In particular,

In all three cases a read operation returns the last value assigned to the variable... With sequential consistency, the processes [have] to agree on a total order of their read/write operations...

... which is violated by the histories we've been talking about.

I am not saying I want to cheat. This explains:

Since leadership transitions occur rapidly in etcd, even a transient hiccup in network or host latency can cause this issue

Ah! I understand now. Yes, in production I'd definitely adjust the timeouts, though that comes with consequences too: higher windows of unavailability for consistent operations during cluster transition. If etcd were to offer consistent reads, you could actually get away with having lower timeouts without compromising safety. Might be a worthwhile tradeoff for users.

aphyr commented 10 years ago

If you have a clear use case for us to execute read in the state machine, we definitely would like to add it. :)

I dunno if you have to execute the read in the state machine or not (this is getting into some raft internals territory I'm unfamiliar with), but I think the obvious case for providing linearizability for consistent=true reads is "doing what the docs say etcd does", haha. Linearizability is definitely a useful property, and one that I think users are eager to have! :)

xiang90 commented 10 years ago

@aphyr Thanks for reporting all these. Seems like we are on the same page now.

So:

  1. We will change our docs to make things more clearly.
  2. We will discuss about letting a special read go through raft.
  3. We will get back to you soon about our decision.

Jespen is awesome! Thanks for all your effort.

aphyr commented 10 years ago

Sounds like a plan! And thanks again for all your hard work on etcd; it's been a real pleasure working with the system and community alike. :)

glycerine commented 10 years ago

To this naive reader, this issue appears to be a correctness bug, and thus a clear showstopper for any use of etcd. Why has a correctness issue been left to languish since April?

bmizerany commented 10 years ago

@glycerine Can you elaborate?

glycerine commented 10 years ago

Again, I'm naively summarizing the above, which seems to provide the detail: a read may actually provide stale data. This is never acceptable.

bmizerany commented 10 years ago

I didn't mean to close this.

aphyr commented 10 years ago

To clarify, I think there are good reasons for a read to provide stale data sometimes; you just shouldn't call it "consistent" or say it returns the most recent value. I think Consul chose a good balance of behaviors here: http://www.consul.io/docs/internals/consensus.html

glycerine commented 10 years ago

I can deliver incorrect and inconsistent results extremely quickly, and without running etcd or any complicated consensus system.

If I'm bothering with a distributed consensus system in the first place, it is because I want consistent results. Always. It should be impossible to do anything other than choice [2] below, quoting from the consul docs.

The three read modes are:

[1]default - Raft makes use of leader leasing, providing a time window in which the leader assumes its role is stable. However, if a leader is partitioned from the remaining peers, a new leader may be elected while the old leader is holding the lease. This means there are 2 leader nodes. There is no risk of a split-brain since the old leader will be unable to commit new logs. However, if the old leader services any reads the values are potentially stale. The default consistency mode relies only on leader leasing, exposing clients to potentially stale values. We make this trade off because reads are fast, usually strongly consistent, and only stale in a hard to trigger situation. The time window of stale reads is also bounded, since the leader will step down due to the partition.

[2]consistent - This mode is strongly consistent without caveats. It requires that a leader verify with a quorum of peers that it is still leader. This introduces an additional round-trip to all server nodes. The trade off is always consistent reads, but increased latency due to an extra round trip.

[3]stale - This mode allows any server to service the read, regardless of if it is the leader. This means reads can be arbitrarily stale, but are generally within 50 milliseconds of the leader. The trade off is very fast and scalable reads but values will be stale. This mode allows reads without a leader, meaning a cluster that is unavailable will still be able to respond.

Rhetorically: Why would I want to run a system that lets me down at the exact point at which I need its guarantees the most? "We are correct, except during a partial system failure" is no correctness at all.

kelseyhightower commented 10 years ago

@glycerine Thank you for your feedback and do know that the CoreOS team is taking this matter very seriously. But with all things there is a process to solving this problem "right now", which is through updating our docs and engaging in conversation with the community.

The next step for the community, which includes CoreOS, is to understand the tradeoffs that we've made in the current design and work together to reexamine those decisions and strongly consider changing them for future releases of etcd.

There is a lot of work going on in collaboration with the community to improve the quality and stability of etcd. We are currently in the late stages of research and design which is being heavily influence by feedback like yours and other etcd users.

At the end of the day we all want etcd to be solid and live up to the promises me make. Thanks for your patience as we continue to work hard to make that happen.

ptomasroos commented 10 years ago

For me who as few reads + write I would definitely want to configure this trade off in order to have the reads go through the state machine in order to be fully consistent. As raftd is not active and etcd is considered to be the closest to production use-cases I really would like to have this improved.

We are having a "showstopper" for us in order to be able to deploy our coming software release since we're assuming it wasn't returning stale reads (silly us, yes).

Anyway, +1 in fixing it and making reads consistent. Let the user decide in a global etcd configuration.

glycerine commented 10 years ago

Am I to understand that there is no workaround here? I figured someone would chime in with "oh, just change this default settings to XYZ, and then you're reads will be strongly consistent, even through they will be slower."

Is the state of the system really that etcd lacks a strongly consistent mode at all? Nothing equivalent to mode [2] of Consul (above)?

xiang90 commented 10 years ago

@glycerine I have provided a hint for a work around in current etcd in my first reply to @aphyr. But I do not encourage people to do so. What is your use case for a consistent read?

As @kelseyhightower mentioned, we treat our system seriously and we are not ready for a decision right now.

glycerine commented 10 years ago

Hi Xiang,

In reality, presumption falls the other way. I always expect consistency. What possible use case is there for an inconsistent consensus protocol?

If I want inconsistent results, I can just skip running a consensus system. Much, much easier. Same result when it matters most.

Could you be more specific about your hint? I don't see any way forward in your above comments, for those of us who need to be able to read the last write reliably.

Thanks, Jason

chadharrington commented 10 years ago

I have to agree with @glycerine. I don't have any use for a consensus system which can return stale reads. Correctness is much more important than performance in my use cases (distributed consensus, locking, transaction monitoring, etc.) PLEASE implement linearizable reads.

Thank you for your great work on etcd and CoreOS. They are great products which we hope to use in production soon.

jefferai commented 10 years ago

Also in agreement with @glycerine. One of the main draws of etcd for me was its presumed use as a consistent consensus system. Otherwise I could use any RDBMS...but I don't want that. I'm happy to have lookups take additional time...I'm not expecting etcd to be blazing fast. What I am expecting is for it to be correct.

philips commented 10 years ago