Open ishan16696 opened 10 months ago
I discovered that merely setting the --experimental-enable-lease-checkpoint
flag to true
is not sufficient. This can lead to issues where the leases TTL can still be reset even after the lease has been checkpointed . For more details, please refer to issue https://github.com/etcd-io/etcd/issues/17132
To address this, it's necessary to also enable the --experimental-enable-lease-checkpoint-persist
flag to true
. This should be done in conjunction with the flag mentioned in the issue, i.e --experimental-enable-lease-checkpoint
.
Interestingly, the --experimental-enable-lease-checkpoint-persist
flag is not listed in the etcd --help
for versions etcd v3.4.26
or etcd v3.5.9
.
~ > etcd --help | grep lease
--experimental-enable-lease-checkpoint 'false'
ExperimentalEnableLeaseCheckpoint enables primary lessor to persist lease remainingTTL to prevent indefinite auto-renewal of long lived leases.
As I mentioned that flag: --experimental-enable-lease-checkpoint-persist
is missing in etcd --help
in etcd version 3.4.x
and version 3.5.x
.
I have opened the PR to add this flag on respective etcd verison: https://github.com/etcd-io/etcd/pull/17189 and https://github.com/etcd-io/etcd/pull/17190
Feature (What you would like to be added): It has been observed that a change in leadership of etcd cluster(or restart of etcd in single node cluster) reset/renewed the etcd leases TTL(time-to-live) as
etcd
don't persist the leases by default. If etcd configuration set this experimental flag --experimental-enable-lease-checkpoint totrue
, then lessor i.e etcd leader will persist the lease by writing a checkpoint onto the disk for every5mins
, so that a change in leadership of etcd cluster(or restart of etcd in single node cluster) won't reset/renew the lease TTL(time to live) ifTTL > 5mins
, and by doing this we can prevent indefinite auto-renewal of lease's TTL: https://github.com/etcd-io/etcd/issues/9888But if we want to persist the lease by setting this flag
--experimental-enable-lease-checkpoint
totrue
, then before enabling it we should also analyse the write throughput and disk usage etc. for persisting the etcd's lease as this shouldn't cause extra load our etcd's performance which already have8Gi
of quota limit and etcd is not very write optimal database.Pre-requisite for persisting the leases by enabling this flag:
--experimental-enable-lease-checkpoint
:--experimental-enable-lease-checkpoint
like using this lease checkpoint will add new raft log entry in etcd cluster which can cause panic in etcd cluster if due to some reason we want to downgrade our etcd cluster to some older etcd version: https://github.com/etcd-io/etcd/pull/10797Finally, if every aspects have been cleared then we can proceed with enabling this flag in our etcds.
--experimental-enable-lease-checkpoint
. Note:etcd-events
not for ouretcd-main
.Motivation (Why is this needed?): It has been observed in one of our live landscape cluster that events were generated with etcd lease of
TTLs 24hours
but due to some reasons leadership changes within 24hours and hence when the leadership changes, theetcd lease
TTLs(time to live) values were reset/renewed by the new leader and this lead to increase in total no. of events as old leases were not revoked as they got renewed due to leadership change, and this leads to total no. of events got accumulated, hence etcd's performance degradation. In such scenario, we are depending on restart and leadership change should be infrequent else if leadership keep changing within 24hours then this will lead to indefinite auto-renewal of lease's TTLs which can leads to accumulation of total no. of events.cc @istvanballok
Approach/Hint to the implement solution (optional):