etcd-cpp-apiv3 / etcd-cpp-apiv3

The etcd-cpp-apiv3 is a C++ library for etcd's v3 client APIs, i.e., ETCDCTL_API=3.
BSD 3-Clause "New" or "Revised" License
333 stars 125 forks source link

etcd三节点部署场景,重启一个etcd节点,客户端抛出异常 #134

Open LiChengZiMu opened 2 years ago

LiChengZiMu commented 2 years ago

etcd三节点部署,此时客户端和etcd三个server都有tcp连接,一般来说,重启其中少数几点,不应该对上层抛错。grpc有一个 ChannelArguments::SetServiceConfigJSON接口,可以设置grpc出错重试机制,如下图:
image 想了解一下,是否有一个可以让该客户端比较稳定工作的grpc channel配置。
非常感谢

sighingnow commented 2 years ago

你是怎么连的呀?每个etcd节点一个client么?

单client是支持多endpoint,如果一个节点失效了会自然用另两个的:https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3#multiple-endpoints

sighingnow commented 2 years ago

你也可以用 SetServiceConfigJSON (https://grpc.github.io/grpc/cpp/classgrpc_1_1_channel_arguments.html#ae9399219c13808b45f3acad088fb0981) 来实现如图的效果,client的constructor是接受一个ChannelArguments 对象的。

LiChengZiMu commented 2 years ago

你也可以用 SetServiceConfigJSON (https://grpc.github.io/grpc/cpp/classgrpc_1_1_channel_arguments.html#ae9399219c13808b45f3acad088fb0981) 来实现如图的效果,client的constructor是接受一个ChannelArguments 对象的。

1. 测试代码:

#include "etcd/Client.hpp"
#include "etcd/KeepAlive.hpp"
#include "etcd/Response.hpp"
#include "etcd/SyncClient.hpp"
#include "etcd/Value.hpp"
#include "etcd/Watcher.hpp"

#include <grpc++/grpc++.h>
#include <grpc++/security/credentials.h>

using grpc::Channel;

int main(int argc, char **argv) {
  std::string endpoints = "http://127.0.0.1:2379,http://127.0.0.1:2389,http://127.0.0.1:2399";
  etcd::Client etcd(endpoints);
  auto keepalive = etcd.leasekeepalive(5).get();
  auto lease_id = keepalive->Lease();

  std::cout << lease_id << std::endl;
  std::string value = std::string("192.168.1.6:1880") + argv[1];
  auto resp1 = etcd.campaign("/leader", lease_id, value).get();
  if (0 == resp1.error_code()) {
    std::cout << "became leader: " << resp1.index() << std::endl;
  } else {
    std::cout << "error code: " << resp1.error_code()
              << "error message: " << resp1.error_message() << std::endl;
    assert(false);
  }
  std::cout << "finish campaign" << std::endl;

  auto resp2 = etcd.leader("/leader").get();
  std::cout << resp2.value().as_string() << std::endl;
  std::cout << resp2.value().key() << std::endl;
  std::cout << "finish leader" << std::endl;

  while (true) {
    keepalive->Check();
  }

  return 0;
}

部署了三个实例的etcd集群,端口号分别为2379,2389,2399。在重启一个etcd实例的情况下,上述测试代码的keepalive->Check()会有概率抛出异常,从而终止程序

2. 我尝试设置ChannelArguments参数如下,但是没有任何效果

image

std::string configJson = "{\"methodConfig\": [{\"name\": [{\"service\": \"etcdserverpb.Lease\", \"method\": \"LeaseKeepAlive\"}], \"retryPolicy\": {\"maxAttempts\":5, \"initialBackoff\": \"0.1s\", \"maxBackoff\": \"1s\", \"backoffMultiplier\": 2.0, \"retryableStatusCodes\": [\"UNAVAILABLE\"]}}]}";  
grpc_args.SetServiceConfigJSON(configJson);  
grpc_args.SetInt(GRPC_ARG_ENABLE_RETRIES, 1);  
sighingnow commented 2 years ago

I can reproduce the issue above.

Working on that.

sighingnow commented 2 years ago

KeepAlive fails to switch between subchannels as stream is stateful and cannot be replayed.

It would be fixed by open a new stream every time the refresh happens.

sighingnow commented 2 years ago

It would be costly, but has the tolerance for failures when multiple endpoints exist.

LiChengZiMu commented 2 years ago

KeepAlive fails to switch between subchannels as stream is stateful and cannot be replayed.

It would be fixed by open a new stream every time the refresh happens.

好的,非常感谢

sighingnow commented 2 years ago

See also: https://github.com/etcd-io/etcd/blob/main/client/v3/retry_interceptor.go

CoderSong2015 commented 2 years ago

https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3/pull/135/commits @sighingnow maybe we just need to renew a keepalive obj to get a new stream when keepalive's exception occurs?

sighingnow commented 2 years ago

https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3/pull/135/commits @sighingnow maybe we just need to renew a keepalive obj to get a new stream when keepalive's exception occurs?

Exactly. But doing requires the client itself to be reconnectable, or, at least save required arguments for reconnecting. I have a draft patch for that and will submit the pull request after finishing the testing of retry logic for keep alive.

LazyPlanet commented 1 year ago

你好,解决了吗?

LazyPlanet commented 1 year ago

大佬,什么时候解决呀?

sighingnow commented 1 year ago

得下个月搞了

LazyPlanet commented 1 year ago

得下个月搞了

最近挺忙啊,都没消息 了

FAKERINHEART commented 4 months ago

原因很简单 keepalive是一个双向流的grpc_context,你连接失败了这个keepalive对应的grpc_context就失败了,你需要的是当收到异常的时候 重新构建keepalive对象 重新创建一个grpc_context来继续请求