etcd三节点部署场景，重启一个etcd节点，客户端抛出异常

LiChengZiMu commented 2 years ago

etcd三节点部署，此时客户端和etcd三个server都有tcp连接，一般来说，重启其中少数几点，不应该对上层抛错。grpc有一个 ChannelArguments::SetServiceConfigJSON接口，可以设置grpc出错重试机制，如下图：
想了解一下，是否有一个可以让该客户端比较稳定工作的grpc channel配置。
非常感谢

sighingnow commented 2 years ago

你是怎么连的呀？每个etcd节点一个client么？

单client是支持多endpoint，如果一个节点失效了会自然用另两个的：https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3#multiple-endpoints

sighingnow commented 2 years ago

你也可以用 SetServiceConfigJSON (https://grpc.github.io/grpc/cpp/classgrpc_1_1_channel_arguments.html#ae9399219c13808b45f3acad088fb0981) 来实现如图的效果，client的constructor是接受一个ChannelArguments 对象的。

LiChengZiMu commented 2 years ago

你也可以用 SetServiceConfigJSON (https://grpc.github.io/grpc/cpp/classgrpc_1_1_channel_arguments.html#ae9399219c13808b45f3acad088fb0981) 来实现如图的效果，client的constructor是接受一个ChannelArguments 对象的。

1. 测试代码：

#include "etcd/Client.hpp"
#include "etcd/KeepAlive.hpp"
#include "etcd/Response.hpp"
#include "etcd/SyncClient.hpp"
#include "etcd/Value.hpp"
#include "etcd/Watcher.hpp"

#include <grpc++/grpc++.h>
#include <grpc++/security/credentials.h>

using grpc::Channel;

int main(int argc, char **argv) {
  std::string endpoints = "http://127.0.0.1:2379,http://127.0.0.1:2389,http://127.0.0.1:2399";
  etcd::Client etcd(endpoints);
  auto keepalive = etcd.leasekeepalive(5).get();
  auto lease_id = keepalive->Lease();

  std::cout << lease_id << std::endl;
  std::string value = std::string("192.168.1.6:1880") + argv[1];
  auto resp1 = etcd.campaign("/leader", lease_id, value).get();
  if (0 == resp1.error_code()) {
    std::cout << "became leader: " << resp1.index() << std::endl;
  } else {
    std::cout << "error code: " << resp1.error_code()
              << "error message: " << resp1.error_message() << std::endl;
    assert(false);
  }
  std::cout << "finish campaign" << std::endl;

  auto resp2 = etcd.leader("/leader").get();
  std::cout << resp2.value().as_string() << std::endl;
  std::cout << resp2.value().key() << std::endl;
  std::cout << "finish leader" << std::endl;

  while (true) {
    keepalive->Check();
  }

  return 0;
}

部署了三个实例的etcd集群，端口号分别为2379，2389，2399。在重启一个etcd实例的情况下，上述测试代码的keepalive->Check()会有概率抛出异常，从而终止程序

2. 我尝试设置ChannelArguments参数如下，但是没有任何效果

std::string configJson = "{\"methodConfig\": [{\"name\": [{\"service\": \"etcdserverpb.Lease\", \"method\": \"LeaseKeepAlive\"}], \"retryPolicy\": {\"maxAttempts\":5, \"initialBackoff\": \"0.1s\", \"maxBackoff\": \"1s\", \"backoffMultiplier\": 2.0, \"retryableStatusCodes\": [\"UNAVAILABLE\"]}}]}";  
grpc_args.SetServiceConfigJSON(configJson);  
grpc_args.SetInt(GRPC_ARG_ENABLE_RETRIES, 1);

sighingnow commented 2 years ago

I can reproduce the issue above.

Working on that.

sighingnow commented 2 years ago

KeepAlive fails to switch between subchannels as stream is stateful and cannot be replayed.

It would be fixed by open a new stream every time the refresh happens.

sighingnow commented 2 years ago

It would be costly, but has the tolerance for failures when multiple endpoints exist.

LiChengZiMu commented 2 years ago

KeepAlive fails to switch between subchannels as stream is stateful and cannot be replayed.

It would be fixed by open a new stream every time the refresh happens.

好的，非常感谢

sighingnow commented 2 years ago

CoderSong2015 commented 2 years ago

https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3/pull/135/commits @sighingnow maybe we just need to renew a keepalive obj to get a new stream when keepalive's exception occurs？

sighingnow commented 2 years ago

https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3/pull/135/commits @sighingnow maybe we just need to renew a keepalive obj to get a new stream when keepalive's exception occurs？

Exactly. But doing requires the client itself to be reconnectable, or, at least save required arguments for reconnecting. I have a draft patch for that and will submit the pull request after finishing the testing of retry logic for keep alive.

LazyPlanet commented 1 year ago

你好，解决了吗？

LazyPlanet commented 1 year ago

大佬，什么时候解决呀？

sighingnow commented 1 year ago

得下个月搞了

LazyPlanet commented 1 year ago

得下个月搞了

最近挺忙啊，都没消息了

FAKERINHEART commented 4 months ago

原因很简单 keepalive是一个双向流的grpc_context，你连接失败了这个keepalive对应的grpc_context就失败了，你需要的是当收到异常的时候重新构建keepalive对象重新创建一个grpc_context来继续请求

etcd-cpp-apiv3 / etcd-cpp-apiv3

etcd三节点部署场景，重启一个etcd节点，客户端抛出异常 #134

1. 测试代码：

2. 我尝试设置ChannelArguments参数如下，但是没有任何效果