etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.13k stars 9.7k forks source link

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

Open haojinming opened 1 year ago

haojinming commented 1 year ago

What happened?

concurrency.NewSession hang after etcd server is kill by SIGSTOP(19)

What did you expect to happen?

NewSession can return error after server is killed.

How can we reproduce it (as minimally and precisely as possible)?

  1. start three or more etcd server nodes.
  2. run main with following codes.
  3. kill -19 pidof etcd leader
package main

import (
    "fmt"
    "time"

    "github.com/pingcap/log"
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
    "go.uber.org/zap"
)

func initEtcdClient() *clientv3.Client {
    var client *clientv3.Client
    var err error
    endpoints := []string{"172.16.5.32:2379", "172.16.5.32:2382", "172.16.5.32:2384"}
    client, err = clientv3.New(clientv3.Config{
        Endpoints:            endpoints,
        DialTimeout:          5 * time.Second,
        DialKeepAliveTimeout: 5 * time.Second,
    })
    if err != nil {
        fmt.Printf("create client fail:%v\\n", err)
        log.Panic("create client fail", zap.Error(err))
    }
    return client
}

func main() {

    number := 0
    client := initEtcdClient()
    for {
        log.Info("create session begin.", zap.Int("time", number))
        s, err := concurrency.NewSession(client)
        if err != nil {
            log.Panic("create client fail", zap.Error(err))
        }
        log.Info("create session finish.", zap.Int("time", number))
        s.Close()
        number++
        time.Sleep(time.Second)
    }
}

Anything else we need to know?

If re-create etcd client after kill -19, it can return error. However, in our application, the client is created at the beginning and stored to use in the while lifecycle of the application.

Etcd version (please run commands below)

go.etcd.io/etcd/client/v3 v3.5.5 ```console $ etcd --version # paste output here $ etcdctl version # paste output here ```

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

haojinming commented 1 year ago

If kill etcd leader with SIGKILL(9), the following error log will print and NewSession can continue acting after new leader is elected.

{"level":"warn","ts":"2022-10-27T10:55:32.940+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000257340/172.16.5.32:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}

I think it should has same behaviour when kill with SIGSTOP, isn't it?

zeminzhou commented 1 year ago

@ahrtr How about add a configuration item that controls the timeout of grpc function calls when create the etcd client?

MatteoGioioso commented 1 year ago

I might be completely wrong here, but I think the cause of this behavior is this option: defaultWaitForReady = grpc.WaitForReady(true) in /go.etcd.io/etcd/client/v3@v3.5.5/options.go; since is not accessible I have tried to set it to false by temporarily modify the file and in my case it does not hang anymore.

zeminzhou commented 1 year ago

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true.

MatteoGioioso commented 1 year ago

I have no idea, I have found an old issue that proposed to expose some of those options, but it went on stale.

Here is the issue: https://github.com/etcd-io/etcd/issues/13344

halegreen commented 1 year ago

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

haojinming commented 1 year ago

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

Yes, the main program hang without any log and action.

huangjiao-heart commented 1 year ago

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true. @zeminzhou I also encountered the same problem, How to change this value in our code?

MatteoGioioso commented 1 year ago

I also encountered the same problem, How to change this value in our code?

@huangjiao-heart unfortunately you can't as it is a private variable, you have to change it from the package.

To be more clear, you have to literally open the file where the variable is hardcoded and change it manually. Or either make a fork and change it from there.

AngstyDuck commented 1 year ago

I'm still quite new to this, but adding a timeout to the LeaseGrant request seem to raise the appropriate error when the servers are killed. Would there be any negative repercussions if we do so?

fuweid commented 1 year ago

There are two cases:

  1. If the ETCD server has been paused by SIGSTOP, the client will wait for the connection ready because of WaitForReady. It can be solved by exporting option.

  2. If the ETCD server is paused after connection ready, the client will wait for http2 response forever. It should be handled by ctx option when you new session.

I'm not sure that reason about pausing the process. It's freezing. It will impact all the ready connections. If you want to stop the server, you should use SIGTERM or SIGKILL.

AngstyDuck commented 1 year ago

If you want to stop the server, you should use SIGTERM or SIGKILL.

This bug can be replicated if the server is stopped by SIGTERM or SIGKILL as well.

The freeze occurs during the creation of a new session, where client attempts to send a LeaseGrant request to the server. Because the grpc option WaitForReady is set to true, the creation of the RPC is queued until the connection is ready, without returning any errors.

I understand from the provided documentation that this option is set to minimise error responses from transient failures. In that case, I'd like to propose setting a timeout to the LeaseGrant gRPC request specifically during the creation of new sessions. I'm thinking of exposing the timeout duration as a configurable value (I'm thinking of naming it as SessionCreationTimeout for now) in the Config object in etcd/client/v3/config.go. Do let me know if this has any negative side effects, otherwise I'll be preparing a PR for this over the next few days :+1:

fuweid commented 1 year ago

@AngstyDuck Yes. gRPC has a background goroutine picker to collect available connection to new transporter. I think it's good option to user if they want to disable waitForReady.