Open chaochn47 opened 5 months ago
Reading trace logs is difficult if the timestamp is disordered. It was improved by https://github.com/etcd-io/etcd/pull/15239 but we could do better with https://github.com/etcd-io/etcd/pull/18108.
Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.
Performance improvements to slow watchers are discussed in https://github.com/etcd-io/etcd/issues/16839
Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides
is removed, test should start to fail intermittently.
I think it's worth diving into the root cause of performance degradation as it would still be helpful when etcd write qps / throughput needs to be lifted up in the future. For example, increasing the default work queue concurrent sync numbers in kube-controller-manager as it would create more mutating requests to etcd iiuc.
Please also note that until this is proved to be a regression this cannot be treated as bug
Agreed +1 That's not a bug but a report in which circumstances etcd mutating requests could take more than 1s instead. I would come up with a benchmark test to simulate the traffic with etcd test framework only.
Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.
@serathius Yes, these tests are run internally not in upstream in a different setup mode ^^^, so which is why you wouldn't be seeing them on test-grid
. I was running into performance issue from etcd
on upstream load tests AWS/KOPS IIRC in the past when i didn't split the events
as a separate etcd cluster
. After events
was split here, I didn't see this appearing on AWS/KOPS load tests
which are run here periodically. I believe we should be able to reproduce performance bottleneck on upstream tests as well if we don't split events in case if we want.
The meta point here is to improve the performance/throughput so we can stretch the single cluster bit more than what it can do today.
Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides is removed, test should start to fail intermittently.
I expect this is an issue of events not using watch cache, however I don't see a reason to invest into an area that K8s has a official mitigation. If you want to remove --etcd-servers-overrides
, please start form discussion in K8s.
If you want to remove --etcd-servers-overrides, please start form discussion in K8s.
Just to clarify, remove --etcd-servers-overrides
is not the intention.
We are trying to figure out the root cause how etcd handling mutating requests becomes slower than 1s with high write qps/throughput. It just happened on events and it could also happen on other resources / key prefixes.
Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to https://github.com/kubernetes/kubernetes/issues/123448. If your are not sharding K8s events out, they pollute your other resources.
Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to https://github.com/kubernetes/kubernetes/issues/123448. If your are not sharding K8s events out, they pollute your other resources.
Enabling watch cache only protects from N direct etcd watches polluting other resources. Problem identified in this investigation is there is only 1 direct etcd watch on events and with enough write events throughput, it pollute other resources.
Hence the following statement was raised.
It just happened on events and it could also happen on other resources / key prefixes.
I think the debate could be closed with a reproduce because it's easier for us to understand our arguments.
@serathius could you please take a look at the reproduce https://github.com/etcd-io/etcd/pull/18121 since the work is based on your watch latency perf benchmark tool?
Edit: This is not a proper reproduce of the issue we have seen in the clusterloader2 test.
Thanks!!
/cc @hakuna-matatah @mengqiy
I don't think there is anything surprising about slow watchers impacting PUT, etcd notifies synchronized watchers as part of committing transaction in apply loop. I don't we change it anytime soon as it's an assumption very ingrained in the code.
Bug report criteria
What happened?
Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the upstream SLO in the clusterloader2 test SLO measurement.
cc @hakuna-matatah
The slow
end transaction
step was caused by thewatchableStore.mutex
lock was accquired by thesyncWatchers
process.What did you expect to happen?
I would like the
syncWatcher
to be completed within100ms
and not holding the lock too long.How can we reproduce it (as minimally and precisely as possible)?
I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived.
Anything else we need to know?
Exploring options
Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not.
Etcd version (please run commands below)
All supported etcd versions
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response