Closed dongsupark closed 7 years ago
@dongsupark could add a bigger description about what the balancer does ? Also, this PR is a bit big. What do you think about a PR for the vendoring packages and another with the logic ? That would make the reviewing much easier and faster.
@hectorj2f Sure, I will split it into multiple PRs, and add more descriptions.
Force-pushed it by changing the following:
IsRegistryReady()
. This function still needs to check r.Status()
like before. It also needs to check if r.registryConn
or r.balancer
is nil.rpcAcquireLeadership()
by inserting reg.IsRegistryReady()
to the condition for checking existing lease. This bug resulted in another failure in TestDynamicClusterNewMemberUnitMigration()
, which got me locked up in a debugging nightmare. (Now it's solved)getHost()
.And I created 2 PRs, https://github.com/coreos/fleet/pull/1676 (only vendor updates) and https://github.com/coreos/fleet/pull/1677 (relevant code), for easier reviews.
FYI. Running benchmarks with this PR included, I can observe performance regressions. Nomi benchmarks starting thousands of units over a multi-node cluster, unit starts take much longer time than the current master branch. I'm not sure why. (gRPC balancer issue? Bug in my patches?) Until we figured out its reason, let's not merge this PR.
Running benchmarks with this PR included, I can observe performance regressions. Nomi benchmarks starting thousands of units over a multi-node cluster, unit starts take much longer time than the current master branch.
I finally got some time to do nomi benchmarks again. This time, I cannot see the regression any more.
When tested with etcd v3, results of master branch and this branch are nearly the same. With etcd v2, this branch is even faster than master. That's actually an expected result, as it's already known that performance got improved with gRPC v1.0, compared to the previous versions.
What I saw in September was probably just because of mistake in testing. Tomorrow I'll run again benchmarks several times, and if it's still green, I'll merge this PR.
I can confirm that its performance is all right. With etcd2, unit start time was improved by 16%. With etcd3, it's nearly the same. Benchmark raw data is available on https://github.com/dongsupark/fleet-benchmarks. I'll merge this PR. Thanks!
As
ClientConn.State()
of gRPC has disappeared, we need to avoid usingClientConn.State()
. Instead we should make use of gRPC rebalancer mechanism, just like etcdv3 is doing. To do that, introducesimpleBalancer
, as a minimum structure to be used forgrpc.Balancer
.Note that this change could be possibly invasive to the current behavior of fleet's gRPC. A careful review is needed.
/cc @hectorj2f Fixes https://github.com/coreos/fleet/issues/1672