yudai commented 6 years ago

What I observed

The defrag command blocks all the requests handled by the process for a while when the process has much data and incoming requests. New requests and ongoing streaming requests such as Watch() can be canceled by their deadlines and the cluster can be unstable.

Side effects

This "stop the world" behavior makes the operation of etcd clusters harder. We have the 8GB limitation of the DB size and need to control the DB size not to reach the border. However, clusters won't release DB space even after compaction until you run the defrag command. That means you cannot track this important metric without virtually stopping every etcd processe periodically.

(Btw, can we add a new metric for "actual DB usage"?)

CC/ @xiang90

gyuho commented 6 years ago

Yeah, we've been meaning to document this behavior.

Do you want to add some notes here https://github.com/coreos/etcd/blob/master/Documentation/op-guide/maintenance.md#defragmentation?

Thanks.

xiang90 commented 6 years ago

@gyuho

we probably want to change the behavior to non-stop-the-world.

gyuho commented 6 years ago

In the meantime,

(Btw, can we add a new metric for "actual DB usage"?)

This would be helpful. No objection.

xiang90 commented 6 years ago

@yudai @gyuho

hm... i am not sure if boltdb exposes that stats itself. need to double check.

yudai commented 6 years ago

@gyuho Thanks for the reference.

We hit an issue with the stop the world with some older version of etcd, so I need to revalidate with the latest version. (I think we got some dead line exceeded errors).

Currently errors are supposed to be hidden by the internal retry mechanism and clients should not get any errors by a process that is stopping for defrag. So if you run defrag on a single node at a time, you should not notice any problem. Does my understanding correct?

In that case, from operational perspective, if we can see the actual usage of the db size. It would be enough.

yudai commented 6 years ago

It seems my assumption above is not true.

I tested with etcd v3.1.14 servers (3-node cluster) and a test client app implemented with clientv3 from master (64713e5927). The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds. I triggered defrag on a node and it took for a while like 8 seconds (I'm using a HDD for this env), then some requests simply returned context dealine exceeded errors. So it's not hidden by the retry mechanism.

User apps most likely detect those errors and alert or start failure state procedures. So I assume it would be great if we can improve the retry mechanism for deadline exceeded error from a node.

gyuho commented 6 years ago

The test client app simply spawns 4 goroutines that issue Put() requests every 10 milliseconds.

Put, as mutable operation, won't be retried by our client balancer to keep at-most once semantics. But we could start with returning errors on in-progress defragmentation, so that users know that requests fail from ongoing maintenance.

yudai commented 6 years ago

I see. That makes sense not to retry mutable operations.

A new option like WithRetryMutable() could be another solution. I think it's easier to add the new option to Put() calls, rather than triaging errors in my code, which I feel more error prone.

xiang90 commented 6 years ago

@yudai

as a first step, how about exposing the in-use/allocated stats of the database? if that ratio is low, a defrag is probably needed.

i checked bbolt, and it only exposes per bucket info here: https://godoc.org/github.com/coreos/bbolt#Bucket.Stats. we need to aggregate all buckets to have a per db view.

do you want to help on exposing the metrics?

yudai commented 6 years ago

@xiang90 It would be a big help for operation. I'll take look in the doc and see if I can add the metrics myself.

yudai commented 6 years ago

Created a PR.

I did some tries and found that Stats.FreePageN is a good indicator to see free space. You can calculate page-level actual usage using this value. PendingPageN is also part of db's internal free list. However, it seems it's not so free actually(for rollback?, you always would see 3 or 4 at this metric anyway), so I didn't include the value to the calculation.

Since boltdb allocates pages before the existing pages get full, and it also manages branches and leafs separately, so byte-level actual usage didn't make sense so much. in-use/allocated does not hit 1 in any cases.

xiang90 commented 6 years ago

@yudai I can see two ways to solve the stop-the-world problem.

automatic client migration before and after the defrag

Before a defrag starts, etcd server needs to notify its client and migrate the clients to other servers. After the defrag, the clients need to migrate back to the server for rebalancing purpose.

make defrag concurrent internally

defrag is basically a database file rewrite. right now, to keep the consistency of the old file and the one being rewritten, we stop accepting writes into the old file. To the defrag concurrent, we need to either write to both old/new or keep a copy of the new writes and flushes it into the new one after the rewrite is finished.

xiang90 commented 6 years ago

/cc @jpbetz @wojtek-t

yudai commented 6 years ago

@xiang90

automatic client migration before and after the defrag

Do we have a control channel for server-initiated notifications now? If not, we might need to embed the notification to replies.

make defrag concurrent internally

This sounds more ideal but complicated solution. What in my mind is we can do something similar to virtual machine live migration for defrag, just like you mentioned. Technically it should be feasible, but I can imagine some precise work is required. Or, we might add something like "forwarding mode", which makes the process to redirect incoming requests to other processes, and activate that while running defrag? A process works as a proxy when it's unresponsive for defrag.

I rather like the option 2, because client implementations don't need to take care of the situation (I guess this kind of procedure is somewhat error prone to implement). I assume we don't like fat clients.

jpbetz commented 6 years ago

2 might not be all that hard to implement.

While 100% defragment a bolt db file is easiest to implement by rewriting the entire file, there might be 90%+ solution that just blocks writes (not reads) while performing a scan of the pages and applies some basic packing. To actually reduce the db file size would still require a subsequent stop-the-world operation to shrink the file, and a routine to recalculate the freelist as part of the file shrink. But that pause should be almost imperceptible- no longer than the current stop-the-world pauses that already happen to expand the file size. That step might require some careful thought, but it seems do-able. I would love to get to the point where bolt keeps statistics on the db file and triggers these types of "defragmentation" operations automatically as needed.

That said, generally having the ability that #1 provides-- taking a member out of rotation for online maintenance-- also sounds useful.

So I'm in favor of doing both, at least eventually :) I'd be happy to spend some time on #2 since I'm already spun up on that code.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho commented 4 years ago

Copying my comment from https://github.com/kubernetes/kubernetes/issues/93280

An important optimization in etcd is the separation of log replication (i.e., Raft) and state machine (i.e., key-value storage) where slow apply in state machine only affects the local key-value storage without blocking replication itself. This presents a set of challenges with defrag operation:

Write request to "any" node (leader or follower) incurs a proposal message to leader for log replication.
The log replication proceeds regardless of the local storage state: Committed index may increment while applied index stalls (e.g. due to slow apply or defragmentation). A state machine still accepts proposals.
As a result, leader's applied index may fall behind follower's, or vice versa.
The divergence stops when the gap between applied and committed index reaches 5,000: Mutable requests fail with an error "too many requests", while immutable requests may succeed.
Worse if slow apply happens due to defrag: all immutable requests block as well.
Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

In the short term, we may reduce the upper bound from 5k to lower number. Also, we can use consistent index to apply pending entries to safely survive the restarts. Right now, consistent index is only used for snapshot comparison and preventing re-applies.

Please correct me if I am wrong.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

cenkalti commented 2 years ago

I have a question about the last bullet point of @gyuho 's comment:

Even worse if the server with pending applies crashes: volatile states may be lost -- violates durability. Raft advances without waiting for state machine layer apply.

Do pending entries stored in memory or do they get written to raft log (disk)?

chaochn47 commented 1 year ago

@cenkalti pending entries are stored in memory but also be written to WAL already. Pending entries must be committed entries that has been persisted in majority of the nodes in the cluster.

etcd-io / etcd

defrag stops the process completely for noticeable duration #9222

What I observed

Side effects

2 might not be all that hard to implement.