etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.4k stars 9.73k forks source link

Improving client call options flexibility and replace retry interceptor #16216

Open CaojiamingAlan opened 1 year ago

CaojiamingAlan commented 1 year ago

What would you like to be added?

1. Expose more client call options

Most of the gRPC call options of clientV3 cannot be configured by users (while some of them can, like MaxCallSendMsgSize) , and results in issues like

We should provide users with more flexibility to make RPC calls.

2. Replace the customized retry interceptor with RetryPolicy

Old gRPC versions do not have retry options, and etcd client implemented retry through the retry interceptor. #12671 had replaced the balancer logic with the gRPC solution, and left a TODO here. Maybe we can finish this replacement now.

3. (Maybe) Set defaultWaitForReady to True

I'm not sure about this. This is motivated by #14631. Actually, by my experiments, not only NewSession() but common usages like Get() may hang: whenever the call is directed to the SIGSTOP-ed server (this always happens since it uses round-robin), it hangs indefinitely waiting for the server to be ready. Neither did I find anywhere configures a default timeout.

It seems the motivation to set WaitForReady to True is for

  1. preventing retry requests from overwhelming the network #8122
  2. minimizing client request error responses due to transient failures

(FYI WaitForReady is the opposite of FailFast, with the latter one being deprecated)

But if once we have the retry and WaitForReady configs exposed, we can change the default behavior to fail fast, and users may turn it on manually, or set a proper retry limit.

I do see caveats of clientV2 but I think we may try to solve it in V3.

Why is this needed?

To provide more flexibility and reduce code volume

dusk125 commented 8 months ago

+1 for this. Something we see in OpenShift is that the API server can fail to retry on slower platforms (non-default LEADER_ELECTION_TIMEOUT & HEARTBEAT_INTERVAL) with an "etcdserver: leader changed" error. The server tries to set the retries to avoid this, but has them overwritten by the etcd retry.

Because of this, I would be happy to take on this work (specifically number 2 if the work needs to be broken up).

serathius commented 8 months ago

cc @ahrtr @serathius

ahrtr commented 8 months ago

There are 3 different feature requests, let's breakdown them into three different tickets.

For 1, I vaguely recall that someone raised a PR to resolve it about 1.5 years ago, and ptabor proposed a generic solution to let users configure all gRPC options. But I couldn't find the PR anymore. Anyway, it seems a valid feature.

For 2, Note both grpc/resolver and grpc/balancer are still experimental features/packages. We should try not to make any related change. But It's open to investigate and discuss.

There is a related issue https://github.com/etcd-io/etcd/issues/15145. But it should be fine because etcd's usege of the grpc-go's resolver package is really very simple and basic cc @dfawley

For 3, please add the context into https://github.com/etcd-io/etcd/issues/14631, and let's continue the discussion there.