etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.8k stars 9.77k forks source link

Defining lower lease TTL than the default hardcoded min value #6025

Closed janosi closed 8 years ago

janosi commented 8 years ago

According to our observations, min lease TTL value is hardcoded to 5s in etcdv3, see 1. We would like to have the possibility to define lower value than this when claiming a lease. Would it be possible to have a functionality when:

xiang90 commented 8 years ago

@janosi Setting Lease TTL low is expensive and dangerous. Whatever TTL you set is just a hint to the server, the server would decide the actual TTL and might even extend it on some cases (new leader elected). The guarantees is that the lease will expire as you expected when there is no server failures or no sudden overload event. 5 seconds seems like a good lower-bound to me. You might want to set it under 5 seconds for test only I guess.

CsatariGergely commented 8 years ago

@xiang90 can you please elaborate a bit what are the dangers of the low TTL setting? In my understanding the aim of TTL is to enforce a refreshment of the data, so the user of the data can take it granted that the data is not older then the TTL of it. If the TTL is only a suggestion and the TTL is changed by the server due to "internal" events of the server, then there is no guarantee for the validaity of the data by the TTL. We (I work together with @janosi ) would like to use the sub 5 sec TTL in production.

xiang90 commented 8 years ago

@CsatariGergely The guarantee is that the lease will not expire before the announced TTL. The main use case of TTL is to implement locking and application lease. They require this behaviors not verse (Lease will expire before its TTL). We can not keep the exact TTL for a few reasons:

  1. lease cannot be renewed when the majority of servers are down. We should not expire leases due to server side error for a few important use cases like lock. We should not unlock due to server errors.
  2. to guarantee the lease expires before TTL, we have to guess how long it takes to finish the expire process in a distributed environment and start the expiry process early. If there is any failure, it would get more complicated.

Moreover, TTL renew takes resources. Low TTL == frequent renew. Low TTL usually means you are trying to do something that is not suitable for using lease. What is your use case?

CsatariGergely commented 8 years ago

@xiang90 Thanks for the explanation. for me it seems we have two discussion now: a) What is the meaning of TTL? Is it a guarantee that the data will be deleted after the TTL expires or it is a guarantee that the data will not be deleted before the TTL expires? Is it the servers or the clients responsibility to act according to the TTL? b) Why is it good/bad to set lower TTL than 5s?

We would like to use TTL as a guarantee that the set data is still valid. If the data is not refreshed after a time then the data will be deleted. This somehow a different usage what you indicated with locking and application lease.

If the low TTL means only more frequent renew and no other problem from the servers point of view, than it could be the clients decision if it would like to use the additional resources for the lot TTL or not.

xiang90 commented 8 years ago

If the data is not refreshed after a time then the data will be deleted.

What does will mean here exactly? Within a bounded time? Under what assumption? No server failures?

If the low TTL means only more frequent renew and no other problem from the servers point of view

Remember that we need to refresh the TTL once there is a leader switch. The minimum TTL needs to be larger than electionTimeout, which is usually 3 seconds or even 5 seconds.

xiang90 commented 8 years ago

@CsatariGergely @janosi Does it answer your questions? For more information, you might want to read the original chubby paper at http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf. We made some minor changes to simplify the protocol a little bit. But the main idea remains the same.

mauiarches commented 8 years ago

@xiang90 , a basic question why in etcdv2 there is no minimum TTL(at least we can have a TTL lower than 5 seconds)? What changed so that in etcdv3 we have now a minimum TTL of 5 seconds? Is this because in etcdv3 keep alives are only processed by the leader and there is no more consensus needed?

Remember that we need to refresh the TTL once there is a leader switch. The minimum TTL needs to be larger than electionTimeout, which is usually 3 seconds or even 5 seconds.

According to this link, electionTimeout can be configured. https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#--election-timeout

heyitsanthony commented 8 years ago

@mauiarches etcd3 can guarantee leases won't expire on leader failover (cf. section 2.8-2.9 of the Chubby paper); this can't be done with strict TTLs like in etcd2. Furthermore, the cluster has no way to respond to load by increasing a strict TTL like it can with leases to get clients to back off. There's always going to be some delay with key expiry-- what sort of timing guarantees do you expect?

I agree the minimum lease time should probably be tied to the election timeout instead of hardcoded as it is now.

xiang90 commented 8 years ago

a basic question why in etcdv2 there is no minimum TTL

They work differently internally. And etcd2's TTL mechanism causes a few difficulties for the core functionalities (lease, leader election, locking) since it does not do the refresh once a new leader is elected.

janosi commented 8 years ago

@xiang90 @heyitsanthony we would like to use etcd as the backend of our service discovery solution. We used etcdv2, actually. In order to shorten the time frame while requests go to dead intances, we would like to use e.g. 2s TTL for our service related records. In case of e.g. ~3k request/s every second counts. TTL would be refreshed in every 1s (as usual, refreshing period is half of the TTL). We were happy to see the lease concept coming. Single lease for multiple records, single TTL for multplie records. Refreshment related traffic is lowered, benefit is clear. At least, according to some etcdv3 release materials it was our understanding. So, now that old TTL has gone, and leases are not for this purpose....what to do? If I follow the migration guide at 1 I would not imagine such big impact on TTL usage, actually.

xiang90 commented 8 years ago

@janosi With the old TTL mechanism, the TTL key might be there for longer than 2 seconds when there is an election. No TTL key can be removed when there is no leader. You just did not notice that I guess. The only thing you would notice now is that if there is an leader election, your lease will be renewed. If you are running etcd instead one DC that has good network connection. Your etcd nodes have good hardward and resources. You can set election timeout to 1 second or so. Then you can get the 2 seconds Lease back. Also I really do not think additional 2 seconds detection latency under failure case matters a lot if at all.

Good enough?

heyitsanthony commented 8 years ago

@janosi As of #6085 you can get a 1s minimum TTL by passing --election-timeout=600 as an argument to etcd (minTTL is computed as ceil((3/2)*electionTimeout).

janosi commented 8 years ago

@xiang90 @heyitsanthony Excuse me for the long delay, I read the Chubby paper, and I try to map it to etcd functions. I have some open questions on my mind, would you mind if I asked those here? First of all, it is not clear for me from the paper, how it can be achieved, that a client can get notification about the disappearing of another client. Let's say, client "A" of Chubby Open() a new node in the Chubby server. If I consider service discovery, it means that client "A" announces its arrival, so the clients of the service of client "A" can start sending requests to client "A". Let's say client "B" gets notification about the arrival of client "A" via the Event notification mechanism of Chubby. So, client "B" can start using client "A". Then client "A" disappears on a non-graceful way. I could not find what mechanism propagates this event to client "B". There are some hints about "ephemeral nodes", but it is also mentioned that their usage is very low in real life. While the usage of Chubby as name service is pretty heavy. I cannot find the link, how the failure of a client is propagated to others if ephemeral nodes are not used anyway.

In case of Chubby the "session lease" is between the clients and the master. The clients and the masters use KeepAlive messages to refresh the lease's "TTL". But my understanding is, that clients of etcd are not the same as clients of Chubby, at least I cannot see the similarity. I mean, the mechanism how the etcd lease TTL is refreshed is different from how the Chubby session lease TTL is refreshed. And actually even the creation of an etcd lease is different from that in case of Chubby. Is it so, that in etcd core there is a Chubby implementation, and around that there is wrapper layer, that translates etcd REST API to Chubby calls? And inside etcd you use e.g. the Chubby KeepAlive mechanism as it is described in the paper?

Excuse me for the questions, but I like understanding things in depth, so I can understand the why's in the design. Thank you!

xiang90 commented 8 years ago

@janosi Short answer is:

chubby ties the session to each client. So it has one to one mapping between a client to a session.

In etcd, we remove this limitation. Lease is a logic object. One client can maintain multiple leases, a lease can be maintained by multiple clients.

I will answer other chubby related questions later on.

xiang90 commented 8 years ago

There are some hints about "ephemeral nodes", but it is also mentioned that their usage is very low in real life. While the usage of Chubby as name service is pretty heavy. I cannot find the link, how the failure of a client is propagated to others if ephemeral nodes are not used anyway.

Watching on ephemeral node is the way to discover client failures. You can proxy all read requests through proxies. So watching might not be that heavy.

I mean, the mechanism how the etcd lease TTL is refreshed is different from how the Chubby session lease TTL is refreshed. And actually even the creation of an etcd lease is different from that in case of Chubby. Is it so, that in etcd core there is a Chubby implementation, and around that there is wrapper layer, that translates etcd REST API to Chubby calls? And inside etcd you use e.g. the Chubby KeepAlive mechanism as it is described in the paper?

Chubby client has internal mechanism to create, revoke, and keepalive a lease. As I mentioned, each Chubby client keeps one lease, and expose it as a concept as session. etcd is more flexible than that. Each clients can have multiple leases with different keys attached. Same leases might maintain by different clients. If you want to use chubby style session in etcd, you can use clientv3/concurrency/session.

But overall, etcd and chubby are similar in a way to keep the "lease" alive.

I think we can close this issue for now.