registry/rpc: use simpleBalancer instead of ClientConn.State()

coreos / fleet

fleet ties together systemd and etcd into a distributed init system

Apache License 2.0

2.42k stars 302 forks source link

registry/rpc: use simpleBalancer instead of ClientConn.State() #1673

Closed dongsupark closed 7 years ago

dongsupark commented 8 years ago

As ClientConn.State() of gRPC has disappeared, we need to avoid using ClientConn.State(). Instead we should make use of gRPC rebalancer mechanism, just like etcdv3 is doing. To do that, introduce simpleBalancer, as a minimum structure to be used for grpc.Balancer.

Note that this change could be possibly invasive to the current behavior of fleet's gRPC. A careful review is needed.

/cc @hectorj2f Fixes https://github.com/coreos/fleet/issues/1672

hectorj2f commented 8 years ago

@dongsupark could add a bigger description about what the balancer does ? Also, this PR is a bit big. What do you think about a PR for the vendoring packages and another with the logic ? That would make the reviewing much easier and faster.

dongsupark commented 8 years ago

@hectorj2f Sure, I will split it into multiple PRs, and add more descriptions.

dongsupark commented 8 years ago

Force-pushed it by changing the following:

Fixed bugs in IsRegistryReady(). This function still needs to check r.Status() like before. It also needs to check if r.registryConn or r.balancer is nil.
Fixed a bug in rpcAcquireLeadership() by inserting reg.IsRegistryReady() to the condition for checking existing lease. This bug resulted in another failure in TestDynamicClusterNewMemberUnitMigration(), which got me locked up in a debugging nightmare. (Now it's solved)
Simplified balancer.go, removing unnecessary parts like getHost().
Added high-level descriptions on simpleBalancer.

And I created 2 PRs, https://github.com/coreos/fleet/pull/1676 (only vendor updates) and https://github.com/coreos/fleet/pull/1677 (relevant code), for easier reviews.

dongsupark commented 8 years ago

FYI. Running benchmarks with this PR included, I can observe performance regressions. Nomi benchmarks starting thousands of units over a multi-node cluster, unit starts take much longer time than the current master branch. I'm not sure why. (gRPC balancer issue? Bug in my patches?) Until we figured out its reason, let's not merge this PR.

dongsupark commented 7 years ago

Running benchmarks with this PR included, I can observe performance regressions. Nomi benchmarks starting thousands of units over a multi-node cluster, unit starts take much longer time than the current master branch.

I finally got some time to do nomi benchmarks again. This time, I cannot see the regression any more.

When tested with etcd v3, results of master branch and this branch are nearly the same. With etcd v2, this branch is even faster than master. That's actually an expected result, as it's already known that performance got improved with gRPC v1.0, compared to the previous versions.

What I saw in September was probably just because of mistake in testing. Tomorrow I'll run again benchmarks several times, and if it's still green, I'll merge this PR.

dongsupark commented 7 years ago

I can confirm that its performance is all right. With etcd2, unit start time was improved by 16%. With etcd3, it's nearly the same. Benchmark raw data is available on https://github.com/dongsupark/fleet-benchmarks. I'll merge this PR. Thanks!