etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.91k stars 9.78k forks source link

Slow watchers impact PUT latency #18109

Open chaochn47 opened 5 months ago

chaochn47 commented 5 months ago

Bug report criteria

What happened?

Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the upstream SLO in the clusterloader2 test SLO measurement.

cc @hakuna-matatah

Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes

In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day1 <= 1s

{
    "level": "info",
    "ts": "2024-05-30T07:25:09.038516Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[734230629] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" mod_revision:66816841 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" value_size:577 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" > >; read_only:false; response_revision:66831091; number_of_response:1; applied-index:66997644; }",
    "duration": "1.146904799s",
    "start": "2024-05-30T07:25:07.891574Z",
    "end": "2024-05-30T07:25:09.038479Z",
    "steps": [
        "trace[734230629] 'register request to wait'  (duration: 14.096µs)",
        "trace[734230629] 'leader propose request'  (duration: 4.609235ms)",
        "trace[734230629] 'follower receive request'  (duration: 3.222613ms)",
        "trace[734230629] 'apply start'  (duration: 8.69631ms)",
        "trace[734230629] 'compare'  (duration: 22.423µs)",
        "trace[734230629] 'check requests'  (duration: 2.474µs)",
        "trace[734230629] 'get key's previous created_revision and leaseID' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 12.374µs)",
        "trace[734230629] 'marshal mvccpb.KeyValue' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 1.589µs)",
        "trace[734230629] 'store kv pair into bolt db' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 8.567µs)",
        "trace[734230629] 'attach lease to kv pair' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 243ns)",
        "trace[734230629] 'applied transaction'  (duration: 1.466µs)",
        "trace[734230629] 'end transaction'  (duration: 1.129944403s)"
    ],
    "step_count": 12
}

The slow end transaction step was caused by the watchableStore.mutex lock was accquired by the syncWatchers process.

{
    "level": "info",
    "ts": "2024-05-31T08:06:25.099259Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[1699572871] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" mod_revision:117957991 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" value_size:565 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" > >; read_only:false; response_revision:117967637; number_of_response:1; applied-index:118404076; }",
    "duration": "1.291462327s",
    "start": "2024-05-31T08:06:23.807769Z",
    "end": "2024-05-31T08:06:25.099231Z",
    "steps": [
        "trace[1699572871] 'apply start'  (duration: 389.561483ms)",
        "trace[1699572871] 'after lock watchableStore mutex'  (duration: 897.589474ms)"
    ],
    "step_count": 2
}
{
    "level": "warn",
    "ts": "2024-05-31T08:06:23.086653Z",
    "caller": "mvcc/watchable_store.go:230",
    "msg": "slow sync watchers process",
    "took": "905.879353ms",
    "expected-duration-threshold": "100ms"
}

What did you expect to happen?

I would like the syncWatcher to be completed within 100ms and not holding the lock too long.

How can we reproduce it (as minimally and precisely as possible)?

I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived.

Anything else we need to know?

Exploring options

  1. Increase the hardcoded chanBufLen = 128 which would put events in the sending buffer, so the watchers won't land into as unsyncd watchers in the first place.
  2. Optimize syncWatcher latency for each run, for example, limit the number of events to read from backend.
  3. Lower the frequency of running syncWatcher so the e2e latency would have less jitters and p99 value would be better.
  4. Optimize the throughput of func (sws *serverWatchStream) sendLoop() strategically.

Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not.

Etcd version (please run commands below)

All supported etcd versions

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

chaochn47 commented 5 months ago

Reading trace logs is difficult if the timestamp is disordered. It was improved by https://github.com/etcd-io/etcd/pull/15239 but we could do better with https://github.com/etcd-io/etcd/pull/18108.

serathius commented 5 months ago

Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.

Performance improvements to slow watchers are discussed in https://github.com/etcd-io/etcd/issues/16839

chaochn47 commented 5 months ago

Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides is removed, test should start to fail intermittently.

I think it's worth diving into the root cause of performance degradation as it would still be helpful when etcd write qps / throughput needs to be lifted up in the future. For example, increasing the default work queue concurrent sync numbers in kube-controller-manager as it would create more mutating requests to etcd iiuc.

Please also note that until this is proved to be a regression this cannot be treated as bug

Agreed +1 That's not a bug but a report in which circumstances etcd mutating requests could take more than 1s instead. I would come up with a benchmark test to simulate the traffic with etcd test framework only.

hakuna-matatah commented 5 months ago

Please link the scalability test failure you were debugging, I'm on this SIG-scalability oncall rotation and I haven't seen any failures. Please also note that until this is proved to be a regression this cannot be treated as bug. See https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance for Kubernetes 5k node scalability testing results.

@serathius Yes, these tests are run internally not in upstream in a different setup mode ^^^, so which is why you wouldn't be seeing them on test-grid. I was running into performance issue from etcd on upstream load tests AWS/KOPS IIRC in the past when i didn't split the events as a separate etcd cluster. After events was split here, I didn't see this appearing on AWS/KOPS load tests which are run here periodically. I believe we should be able to reproduce performance bottleneck on upstream tests as well if we don't split events in case if we want.

The meta point here is to improve the performance/throughput so we can stretch the single cluster bit more than what it can do today.

serathius commented 5 months ago

Thanks for looking into it. The cluster set up is unique which does not segregate events from the main etcd cluster. So the existing upstream GCE and AWS 5k nodes scale tests are passing now consistently. I believe once the flag --etcd-servers-overrides is removed, test should start to fail intermittently.

I expect this is an issue of events not using watch cache, however I don't see a reason to invest into an area that K8s has a official mitigation. If you want to remove --etcd-servers-overrides, please start form discussion in K8s.

chaochn47 commented 5 months ago

If you want to remove --etcd-servers-overrides, please start form discussion in K8s.

Just to clarify, remove --etcd-servers-overrides is not the intention.

We are trying to figure out the root cause how etcd handling mutating requests becomes slower than 1s with high write qps/throughput. It just happened on events and it could also happen on other resources / key prefixes.

serathius commented 5 months ago

Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to https://github.com/kubernetes/kubernetes/issues/123448. If your are not sharding K8s events out, they pollute your other resources.

chaochn47 commented 5 months ago

Right, the reason is simple, the event resource has watch cache disabled, meaning it is still vulnerable to https://github.com/kubernetes/kubernetes/issues/123448. If your are not sharding K8s events out, they pollute your other resources.

Enabling watch cache only protects from N direct etcd watches polluting other resources. Problem identified in this investigation is there is only 1 direct etcd watch on events and with enough write events throughput, it pollute other resources.

Hence the following statement was raised.

It just happened on events and it could also happen on other resources / key prefixes.

I think the debate could be closed with a reproduce because it's easier for us to understand our arguments.

chaochn47 commented 5 months ago

@serathius could you please take a look at the reproduce https://github.com/etcd-io/etcd/pull/18121 since the work is based on your watch latency perf benchmark tool?

Edit: This is not a proper reproduce of the issue we have seen in the clusterloader2 test.

Thanks!!

/cc @hakuna-matatah @mengqiy

serathius commented 5 months ago

I don't think there is anything surprising about slow watchers impacting PUT, etcd notifies synchronized watchers as part of committing transaction in apply loop. I don't we change it anytime soon as it's an assumption very ingrained in the code.