Kong Ingress Controller cannot scale beyond a limit

krish7919 commented 2 years ago

Current Behavior

Scenario:

Create 1500 Secrets in a namespace with approximately 1MB data in each of the Secret.

This was done to reproduce an issue that we faced in production where we had thousands of Secrets, with a cumulative data size of 1.8 GB. This test we do simulates this real-world scenario with ~1.5GB in the cluster.

Expected Behavior

Kong restarts should work fine, both the proxy and the ingress controller should come up and stay in Running state.

Steps To Reproduce

Create Secrets totalling ~1.5 GB of data, preferably in the same namespace where Kong is running. (I have a go module that does this, let me know if that helps and I will try to make it available)
Restart Kong.
Observe that Kong goes into an infinite CrashLoopBackoff

Kong Ingress Controller version

2.1.1, but should exist in main too. Going through the code makes me think that this is embedded deep inside the sig.k8s.io module which doesn't handle pagination effectively. Kong is running in DB-less mode.

Kubernetes version

$ kubectl --context test version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.15", GitCommit:"58178e7f7aab455bc8de88d3bdd314b64141e7ee", GitTreeState:"clean", BuildDate:"2021-09-15T19:23:02Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Anything else?

Discussion in https://discuss.konghq.com/t/kong-ingress-crontroller-in-dbless-mode-error-failed-to-wait-for-secret-caches-to-sync-timed-out-waiting-for-cache-to-be-synced/9923.

shaneutt commented 2 years ago

Thank you for your report.

Could you provide the events and logs from the ingress controller deployment and pods for the crash loop to show the cause? Is it oomkiller?

Also in version v2.0.0+ we added the --watch-namespaces option which allows you to select specific namespaces to watch for [Ingress and other relevant] resources rather than watching all namespaces. It sounded like most of these secrets were not Kong relevant secrets (e.g. Credentials for Consumers) so I wanted to check whether using this to filter out namespaces with bulk Secrets might have a mitigating effect for you while we otherwise look for a long term fix?

krish7919 commented 2 years ago

Logs are as in the discussion link:

time="2022-03-08T17:23:57Z" level=debug msg="no configuration change, skipping sync to kong" subsystem=proxy-cache-resolver
I0308 17:23:57.491252       1 trace.go:205] Trace[1752425436]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.23.0-alpha.1/tools/cache/reflector.go:167 (08-Mar-2022 17:22:57.489) (total time: 60001ms):
Trace[1752425436]: [1m0.001703338s] [1m0.001703338s] END
E0308 17:23:57.491272       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.0-alpha.1/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 197; INTERNAL_ERROR; received from peer
time="2022-03-08T17:24:00Z" level=debug msg="no configuration change, skipping sync to kong" subsystem=proxy-cache-resolver
...
...
...
time="2022-03-08T17:24:57Z" level=debug msg="no configuration change, skipping sync to kong" subsystem=proxy-cache-resolver
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for secret caches to sync: timed out waiting for cache to be synced" logger=controller.secret reconciler group= reconciler kind=Secret
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for kongplugin caches to sync: timed out waiting for cache to be synced" logger=controller.kongplugin reconciler group=configuration.konghq.com reconciler kind=KongPlugin
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for kongconsumer caches to sync: timed out waiting for cache to be synced" logger=controller.kongconsumer reconciler group=configuration.konghq.com reconciler kind=KongConsumer
time="2022-03-08T17:24:57Z" level=info msg="context done: shutting down the proxy update server" subsystem=proxy-cache-resolver
time="2022-03-08T17:24:57Z" level=error msg="context completed with error" error="context canceled" subsystem=proxy-cache-resolver
I0308 17:24:57.489210       1 trace.go:205] Trace[1371614229]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.23.0-alpha.1/tools/cache/reflector.go:167 (08-Mar-2022 17:23:58.818) (total time: 58670ms):
Trace[1371614229]: [58.670670788s] [58.670670788s] END
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for ingress caches to sync: timed out waiting for cache to be synced" logger=controller.ingress reconciler group=networking.k8s.io reconciler kind=Ingress
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for service caches to sync: timed out waiting for cache to be synced" logger=controller.service reconciler group= reconciler kind=Service
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for tcpingress caches to sync: timed out waiting for cache to be synced" logger=controller.tcpingress reconciler group=configuration.konghq.com reconciler kind=TCPIngress
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for endpoints caches to sync: timed out waiting for cache to be synced" logger=controller.endpoints reconciler group= reconciler kind=Endpoints
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for kongclusterplugin caches to sync: timed out waiting for cache to be synced" logger=controller.kongclusterplugin reconciler group=configuration.konghq.com reconciler kind=KongClusterPlugin
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for kongconsumer caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for udpingress caches to sync: timed out waiting for cache to be synced" logger=controller.udpingress reconciler group=configuration.konghq.com reconciler kind=UDPIngress
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for ingress caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="Could not wait for Cache to sync" error="failed to wait for kongingress caches to sync: timed out waiting for cache to be synced" logger=controller.kongingress reconciler group=configuration.konghq.com reconciler kind=KongIngress
time="2022-03-08T17:24:57Z" level=info msg="stop status update channel."
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for service caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for tcpingress caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for endpoints caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for kongclusterplugin caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for udpingress caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for kongingress caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="failed to wait for kongplugin caches to sync: timed out waiting for cache to be synced"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="context canceled"
time="2022-03-08T17:24:57Z" level=error msg="error received after stop sequence was engaged" error="context canceled"
Error: failed to wait for secret caches to sync: timed out waiting for cache to be synced
Error: failed to wait for secret caches to sync: timed out waiting for cache to be synced

It is not an OOM, although if I keep adding Secrets, it sometimes goes to OOM.

I have already evaluated --watch-namespaces option and it is unfortunately not an option for us as we need to watch all namespaces in the cluster.

krish7919 commented 2 years ago

Might be related: https://github.com/helm/helm/pull/10715, which we hit in the same cluster with huge amounts of Secrets, and that was solved by using proper pagination.

shaneutt commented 2 years ago

Understood, thank you for all these great details on the issue. :bow:

krish7919 commented 2 years ago

@shaneutt Can we get a quick triage from the Kong team? I am already working on a fix, but have a hard time understanding the code flow, esp in client-go wrt SharedIndexInformer.

shaneutt commented 2 years ago

We may not be able to provide fully synchronous help at this moment, but if you have specific questions about the code please feel free to ask them here, OR we can continue in the Kubernetes Slack thread where we originally started talking about this.

shaneutt commented 2 years ago

@krish7919 and I discussed this at length in Slack: the problem does appear to occur due to the default configurations and behaviors of controller-runtime, and ultimately it culminates in this error:

https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/internal/controller/controller.go#L209

This appears to occur only during the manager.Client startup phase where the cache is initially populated (a running KIC which suddenly has Secrets applied to the cluster does not fail).

The next steps here are to determine if the cache.Start which triggers this can be configured to use ListOptions such as .Limit and .Continue to reduce the number of resources it tries to .List in a single request during this initialization. If it doesn't have these as tunables, then a PR against controller-runtime may need to be made to make it possible (potentially relevant: https://github.com/kubernetes-sigs/controller-runtime/issues/532). If it does have them, then a PR against the KIC to provide different defaults than what controller-runtime provides and/or flags to tune this via the manager may be warranted.

@krish7919 has said he's going to continue investigating this for the time being. @krish7919 let us know where your investigation takes you and if you need any further help from us :vulcan_salute:.

rainest commented 2 years ago

tl;dr I think you may need to add a new feature to client-go to control the page size of the controllers created by SharedInformers. What was the change mentioned here?

The default pagination in client-go to use the pager module is set to 500, I have reduced it to 100 and recompiled and tested. The results are the same and there is no improvement.

I'm not sure if that's the same thing I found or if it'd override it further upstream and have the same effect. You should be able to hard-code WatchListPageSize: <whatever> into the SharedInformer Run() to see if it works instead.

As far as the code, we don't manage the client-go instance that interacts with the Kubernetes API directly. We configure a controller-runtime Controller and let it handle the client operations/feed us the result. That's created in controller setup (from a code generator).

That Watch() adds a Kind source, which runs its Start() method when the controller manager starts. It waits for its cache to sync by calling the related client-go function. This just loops over a bunch of provided sync functions to confirm they're returning true. These functions are SharedInformer HasSynced() functions of each of the Kind sources.

That top-level HasSynced(), as I follow it, calls down through a few other levels of other HasSynced() functions before ultimately reaching the controller FIFO queue HasSynced(), though I didn't review that enough to explain the condition accurately.

Anyway, ignoring what that actually does, I noticed there's a WatchListPageSize option for cache.Controllers which controls the ListAndWatch() pager page size, but controller-runtime never calls cache.New(). It does call the informers Run() method though, and that creates a new controller. Sadly, the controller config is built from informer parameters in Run(), and informers don't provide a way to populate WatchListPageSize.

Afterword:

Increasing the Controller CacheSyncTimeout may help--not quite sure, it may just allow it to be broken longer.

Are the large Secrets content KIC is indeed supposed to ingest? I'd expect not, since individual credentials shouldn't be 1MB each, and I'm unsure how many realistically-sized credentials you'd need to hit that threshold. Our credentials implementation has a design flaw, where we use a kongCredType key to identify Secrets we should ingest, but should have instead used labels, as we can filter watches on those. However, changing to a label-based system is a breaking change, and we haven't planned a migration to that--this is more indication that we probably should.

krish7919 commented 2 years ago

As @shaneutt had requested in the Slack chat, here is a stack trace at the time the error happens:

E0401 08:43:07.341118       1 runtime.go:78] Observed a panic: &fmt.wrapError{msg:"stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 191; INTERNAL_ERROR; received from peer", err:http2.StreamError{StreamID:0xbf, Code:0x2, Cause:(*errors.errorString)(0xc00011fa60)}} (stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 191; INTERNAL_ERROR; received from peer)
goroutine 563 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1659e00, 0xc0004d6080})
    /workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2540be400})
    /workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1659e00, 0xc0004d6080})
    /usr/local/go/src/runtime/panic.go:1038 +0x215
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch.func1(0xc000a0e000, 0xc000c99200, 0xc00031af60, 0xc000c82b00)
    /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:331 +0x94e
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000a0e000, 0xc00031af60)
    /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:374 +0x345
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
    /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:223 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f188d06a3b0)
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006d4190, {0x1a4aa40, 0xc000350ff0}, 0x1, 0xc00031af60)
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000a0e000, 0xc00031af60)
    /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:222 +0x237
k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithChannel.func1()
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:56 +0x22
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x88
panic: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 191; INTERNAL_ERROR; received from peer [recovered]
    panic: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 191; INTERNAL_ERROR; received from peer

Unfortunately, this is affecting our production and I cannot afford to spend more time on this right now. I will try to mitigate the risks by trying to refactor our code to avoid large number of Secrets (pull from S3 or soemthing as such). Anyway, it would be a long time to get an MR out, and then to get it merged in either way.

I will get back to this later if time permits.

shaneutt commented 2 years ago

Thanks for your reporting and digging @krish7919, I understand at this point you need to focus on mitigations and can't keep the focus on investigation and patching. We'll organize this as part of our regular work for maintainers going forward, but if you feel so inclined to take it on again before then just let us know. :vulcan_salute:

krish7919 commented 2 years ago

Why is this a severity/medium and no assignees? :)

Doesn't it affect the horizontal scalability claimed for Kong and KIC, which is a core feature for any customer/team looking at deploying and using Kong?

shaneutt commented 2 years ago

severity/high is most commonly used for "stop the world and get this fixed" type things. You could also think of "severity" in this context as "impact". Given that you are the only one to have reported this problem in the last year and you're using a workaround/mitigation for it and the additional context is that --watch-namespace restriction and KubeAPI configuration tuning are potentially valid mitigations as well: the impact is not the same as say, a bug that would break production for every user in every configuration and there's no known mitigations. I'm sorry of the "medium" language seems off-putting, we would be open to re-assessing the language we use for our work organization if it's not friendly or clear enough.

As for assignment nobody is working on this right at this moment but we will be pulling it into a future sprint to be worked on by maintainers and will endeavor to do so quickly. If a non-maintainer contributor wants to take it instead in the meantime we welcome that. If you're a Kong customer we would recommend you report this issue to our support team for their awareness and aid.

mflendrich commented 2 years ago

Acceptance criteria for a fix

[ ] Ensure that pagination happens in the initial List() call that controller-runtime performs on startup of KIC
[ ] Load-test KIC with 1.5GB of secrets to validate the fix

shaneutt commented 2 years ago

@pmalek the updates from maintainers in https://github.com/kubernetes-sigs/controller-runtime/issues/532 so far seem to suggest that we've stumbled upon a more fundamental issue with controller-runtime that wont be fixed by adding pagination in the way we thought initially.

I'm starting to think we might want to take a slightly different approach with this one: The fact that the initial listing can be so large that it trips the default api server limits does not appear at all specific to Kong, and in fact seems like it should be a problem for all controller-runtime based implementations as a whole.

Perhaps we should try and create a reproduction environment: an example controller from scratch with kubebuilder or operator-sdk which would also perform this initial listing, create a kind cluster, deploy a huge number of objects to the cluster, and then deploy our example operator to the cluster to reproduce this issue outside of Kong context.

If we can provide a clear list of reproduction steps that doesn't require anything Kong we should be able to report this as a new and separate issue upstream and bring it up for consideration at the next community sync to get people's heads thinking on it. Let me know your thoughts?

krish7919 commented 2 years ago

Perhaps we should try and create a reproduction environment: an example controller from scratch with kubebuilder or operator-sdk which would also perform this initial listing, create a kind cluster, deploy a huge number of objects to the cluster, and then deploy our example operator to the cluster to reproduce this issue outside of Kong context.

I am already stealing time at work and on my way to:

Create a KinD cluster with custom configs: set apiserver request timeout to 10s
Deploy our custom services + Kong on them
Have a tool that can create multiple Secrets in parallel to trigger this issue

As mentioned before, I can open source the tool to create Secrets in parallel, but (as of now) cannot open source the KinD setup scripts as they are too detailed and try to replicate our cloud deployments to a large extent. But I can strip this down later if needed, and open source a subset of work. Wrt not testing with Kong, I can create a new operator, but would try to avoid the work and just try to repro things with Kong and try to see if we can do something.

shaneutt commented 2 years ago

Perhaps we should try and create a reproduction environment: an example controller from scratch with kubebuilder or operator-sdk which would also perform this initial listing, create a kind cluster, deploy a huge number of objects to the cluster, and then deploy our example operator to the cluster to reproduce this issue outside of Kong context.

I am already stealing time at work and on my way to:
1. Create a KinD cluster with custom configs: set apiserver request timeout to 10s

2. Deploy our custom services + Kong on them

3. Have a tool that can create multiple Secrets in parallel to trigger this issue
As mentioned before, I can open source the tool to create Secrets in parallel, but (as of now) cannot open source the KinD setup scripts as they are too detailed and try to replicate our cloud deployments to a large extent. But I can strip this down later if needed, and open source a subset of work. Wrt not testing with Kong, I can create a new operator, but would try to avoid the work and just try to repro things with Kong and try to see if we can do something.

Awesome, thanks for working on that. As for the setup being replications of your cloud deployments: based on everything we know so far in this issue I don't think this would be necessary, it seems like the problem just happens due to large numbers of secrets with large payloads, that should be replicate-able in a generic environment setup don't you think?

pmalek commented 2 years ago

@krish7919 PTAL at my comment https://github.com/kubernetes-sigs/controller-runtime/issues/532#issuecomment-1141132416

I believe there might be a bigger issue related to how a big number of secrets (Only? Or is this related to ConfigMaps as well?) with relatively big payloads (1MB in my experiments) are causing high CPU and memory usage on apiserver which (in turn) slows down the communication with the cluster.

krish7919 commented 2 years ago

I do not think it is a CPU or a memory issue, it is purely a timeout issue and the inefficient listing of Secrets. As I mentioned above, I can reproduce this issue at will now with a KinD cluster running locally with a request timeout set to 5s and by creating ~25-40 Secrets each with 1MB of data.

I am still trying to understand what the other issues are saying and why a direct etcd listing would be worse or why pagination wouldn't work. I am still thinking that it should be solved using pagination (as I solved a similar issue in Helm as mentioned in https://github.com/Kong/kubernetes-ingress-controller/issues/2382#issuecomment-1084899198 above)

Maybe my knowledge is limited here as to what Kong Ingress Controller does special here, and why it (or client-go) cannot use pagination.

pmalek commented 2 years ago

Unfortunately my knowledge is still a bit limited w.r.t this issue as well.

I'm not going to work actively on this issue for now but PTAL at my comment in the other issue where I've compared using page list and non-page list without too much difference.

shaneutt commented 2 years ago

Let us know if you're still planning on digging in here @krish7919 and any findings you uncover.

krish7919 commented 2 years ago

I am blocked on this as I cannot figure out the flow on the controller-runtime side of things as the comment above suggested. Pagination does not seem to work either.

krish7919 commented 2 years ago

Here's a summary of my analysis so far. As mentioned before, I had a hunch that this is a pagination issue and adding pagination on the client side will fix the issue. This doesn't seem to be the case.

The workflow that Kubernetes, from my understanding, is as follows:

The client sends a request for the secrets like so GET /api/v1/secrets?limit=500&resourceVersion=0.
However, for pagination to work, the apiserver reads all the secrets from etcd and then stores them in memory.
This full dump of Secret data is then paginated to the client.

What is actually happening here:

The client (here the Kong Ingress Controller) sends a request for the secrets like so GET /api/v1/secrets?limit=500&resourceVersion=0.
The apiserver cannot read the Secrets from etcd due to the scale issue. Cluster has ~1.5GB of Secret data. And KIC needs to read this at startup to keep its local cache store in sync with data in etcd.
Even though the client can specify pagination (with say a limit of 20), it is the apiserver which times out as it cannot read the data from etcd in the default timeout period; 60s by default in our production Kubernetes cluster (using AWS EKS), and 5s in my local tests using Kind.
This causes the KIC to go in a crash loop and never recover.

The stack trace from kube-apiserver is as follows:

2022-03-08 17:37:27.000,"E0308 17:37:27.074039      10 runtime.go:78] Observed a panic: &errors.errorString{s:""killing connection/stream because serving request timed out and response had been started""} (killing connection/stream because serving request timed out and response had been started)
goroutine 872445776 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x3f96260, 0xc000698a10)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc012ef9c98, 0x1, 0x1)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x3f96260, 0xc000698a10)
    /usr/local/go/src/runtime/panic.go:969 +0x1b9
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0b8155120, 0xc01fc2fe00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1bc
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc01c7910a0, 0x5171da0, 0xc167a1e000, 0xc09bd44300)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x2f3
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x5171da0, 0xc167a1e000, 0xc09bd44200)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:59 +0x137
net/http.HandlerFunc.ServeHTTP(0xc01c7955f0, 0x5171da0, 0xc167a1e000, 0xc09bd44200)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x5171da0, 0xc167a1e000, 0xc09bd44100)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x269
net/http.HandlerFunc.ServeHTTP(0xc01c795620, 0x5171da0, 0xc167a1e000, 0xc09bd44100)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1(0x5171da0, 0xc167a1e000, 0xc09bd44000)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/audit_annotations.go:37 +0x142
net/http.HandlerFunc.ServeHTTP(0xc01c7910c0, 0x5171da0, 0xc167a1e000, 0xc09bd44000)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1(0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/warning.go:35 +0x1a7
net/http.HandlerFunc.ServeHTTP(0xc01c7910e0, 0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc01c791100, 0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:91 +0x2f2
net/http.HandlerFunc.ServeHTTP(0xc01c791120, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0xe6
net/http.HandlerFunc.ServeHTTP(0xc01c791140, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc01c795650, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc0087190a0, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2843 +0xa3
net/http.initALPNRequest.ServeHTTP(0x5177620, 0xc05cc33500, 0xc0f0096380, 0xc0087190a0, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:3415 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc0a0e46c00, 0xc027936fd0, 0xc03bb7de00, 0xc0b8154fa0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2152 +0x8b
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1882 +0x505"
2022-03-08 17:37:27.000,"E0308 17:37:27.074113      10 wrap.go:39] apiserver panic'd on GET /api/v1/secrets?limit=500&resourceVersion=0
http2: panic serving 10.11.12.150:47184: killing connection/stream because serving request timed out and response had been started
goroutine 872445776 [running]:
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc027936fd0, 0xc012ef9f8e, 0xc0a0e46c00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2145 +0x16f
panic(0x3f96260, 0xc000698a10)
    /usr/local/go/src/runtime/panic.go:969 +0x1b9
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc012ef9c98, 0x1, 0x1)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x3f96260, 0xc000698a10)
    /usr/local/go/src/runtime/panic.go:969 +0x1b9
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0b8155120, 0xc01fc2fe00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:257 +0x1bc
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc01c7910a0, 0x5171da0, 0xc167a1e000, 0xc09bd44300)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:141 +0x2f3
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x5171da0, 0xc167a1e000, 0xc09bd44200)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/waitgroup.go:59 +0x137
net/http.HandlerFunc.ServeHTTP(0xc01c7955f0, 0x5171da0, 0xc167a1e000, 0xc09bd44200)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x5171da0, 0xc167a1e000, 0xc09bd44100)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:39 +0x269
net/http.HandlerFunc.ServeHTTP(0xc01c795620, 0x5171da0, 0xc167a1e000, 0xc09bd44100)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1(0x5171da0, 0xc167a1e000, 0xc09bd44000)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/audit_annotations.go:37 +0x142
net/http.HandlerFunc.ServeHTTP(0xc01c7910c0, 0x5171da0, 0xc167a1e000, 0xc09bd44000)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1(0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/warning.go:35 +0x1a7
net/http.HandlerFunc.ServeHTTP(0xc01c7910e0, 0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc01c791100, 0x5171da0, 0xc167a1e000, 0xc03bb7df00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:91 +0x2f2
net/http.HandlerFunc.ServeHTTP(0xc01c791120, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:51 +0xe6
net/http.HandlerFunc.ServeHTTP(0xc01c791140, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2042 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc01c795650, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc0087190a0, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:2843 +0xa3
net/http.initALPNRequest.ServeHTTP(0x5177620, 0xc05cc33500, 0xc0f0096380, 0xc0087190a0, 0x5164460, 0xc027936fd0, 0xc03bb7de00)
    /usr/local/go/src/net/http/server.go:3415 +0x8d
k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).runHandler(0xc0a0e46c00, 0xc027936fd0, 0xc03bb7de00, 0xc0b8154fa0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:2152 +0x8b
created by k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*serverConn).processHeaders
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/server.go:1882 +0x505"
2022-03-08 17:37:27.000,E0308 17:37:27.074450      10 writers.go:107] apiserver was unable to write a JSON response: http: Handler timeout
2022-03-08 17:37:27.000,"E0308 17:37:27.074469      10 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:""http: Handler timeout""}"
2022-03-08 17:37:27.000,E0308 17:37:27.075645      10 writers.go:120] apiserver was unable to write a fallback JSON response: http: Handler timeout
2022-03-08 17:37:27.000,"I0308 17:37:27.076749      10 trace.go:205] Trace[1088740983]: ""List"" url:/api/v1/secrets,user-agent:manager/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.11.12.150 (08-Mar-2022 17:36:27.074) (total time: 60002ms):
Trace[1088740983]: ---""Writing http response done"" count:9384 59993ms (17:37:00.076)
Trace[1088740983]: [1m0.002636616s] [1m0.002636616s] END"

I also looked at the pagination proposal for Kubernetes clients and apiserver here and the documentation here.

It specifically states that:

Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.

    Controllers with full caches
        Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results

But no matter whether we page or not at the client side, this will cause issues reading from etcd. The KEP also provide an Alternatives section at the end stating:

Alternatives

Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.
...

So is this a kube-apiserver CPU and/or memory issue we are hitting here?

I am going to try to raise the limits in our various EKS clusters and perform a scale test again to see if it improves the situation in any way.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pmalek commented 2 years ago

@krish7919 Thanks for your analysis.

it is the apiserver which times out as it cannot read the data from etcd in the default timeout period; 60s by default in our production Kubernetes cluster (using AWS EKS), and 5s in my local tests using Kind

Just for completeness, do you mind mentioning how are you setting this in your Kind cluster? Is this via runtimeConfig?

Commenting on your findings so far: it seems that for now we're in a position in which we can't really do anything in the controller code which would improve the experience in use cases like yours, unless there's a proposal upstream in k8s that would allow server side listing with pagination (between apiserver and etcd).

At this point, we'll be looking forward for your further results and we'll leave this open in case there are any developments that would allow us to improve the experience.

krish7919 commented 2 years ago

Yes, this is my Kind config for reference:

            kind_config = 
# Refer https://github.com/kind-ci/examples/blob/master/gitlab/kind-config.yaml
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
networking:
  apiServerAddress: "0.0.0.0"

# Add an extra `kube-apiserver` argument to set a custom request timeout
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
      extraArgs:
        request-timeout: {}s

# Add to the apiServer certSANs the name of the docker (dind) service in order
# to be able to reach the cluster through it
kubeadmConfigPatchesJSON6902:
  - group: kubeadm.k8s.io
    version: v1beta2
    kind: ClusterConfiguration
    patch: |
      - op: add
        path: /apiServer/certSANs/-
        value: host.docker.internal

nodes:
  - role: control-plane
    extraMounts:
      - containerPath: /var/lib/kubelet/config.json
        hostPath: {}
    extraPortMappings:
      # For postgres
      - containerPort: 32000
        hostPort: 5432
        listenAddress: 127.0.0.1
        protocol: TCP
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "{}-{}/node-type=general-worker"
    extraMounts:
      - containerPath: /var/lib/kubelet/config.json
        hostPath: {}
  - role: worker
    kubeadmConfigPatches:
      - |
        kind: JoinConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "{}-{}/node-type=general-worker"
    extraMounts:
      - containerPath: /var/lib/kubelet/config.json
        hostPath: {}
""".format(
                apiserver_request_timeout_secs,
                ecr_auth_file.name,
                platform_name,
                env_name,
                ecr_auth_file.name,
                platform_name,
                env_name,
                ecr_auth_file.name,
            )

krish7919 commented 2 years ago

Commenting on your findings so far: it seems that for now we're in a position in which we can't really do anything in the controller code which would improve the experience in use cases like yours, unless there's a proposal upstream in k8s that would allow server side listing with pagination (between apiserver and etcd).

At this point, we'll be looking forward for your further results and we'll leave this open in case there are any developments that would allow us to improve the experience.

I would not say that you shouldn't do anything; being a company that ships a controller on Kubernetes, you can have the means/clout to drive this with the Kubernetes community OR put in resources to fix this from your end. :)

prateekgogia commented 2 years ago

When a client request uses this requestURI limit=500&resourceVersion=0 request is served from the watch cache in API server and not from etcd. API server will ignore the limit=500 and return all the objects without any pagination. Since, this response payload is too big, API server tries to compress the response (gzip) to the client. This is where API server adds the most delay in large response sizes.

One workaround, clients can try doing is to disableCompression from the client side, this way API server will not gzip the response back to the client and request should complete within the desired amount of time. Please note that this might increase the network bandwidth consumption for the client side as well as the API server side as large amount of data will be transmitted on the wire now.

krish7919 commented 2 years ago

Thanks for the pointer @prateekgogia (from the AWS EKS team) !

I can comfirm disabling compression after this issue is encountered helps fix this issue. This is my changeset:

--- a/internal/manager/run.go
+++ b/internal/manager/run.go
@@ -55,6 +55,8 @@ func Run(ctx context.Context, c *Config, diagnostic util.ConfigDumpDiagnostic) e

        setupLog.Info("getting the kubernetes client configuration")
        kubeconfig, err := c.GetKubeconfig()
+       // krish: disable compression here
+       kubeconfig.DisableCompression = true
        if err != nil {
                return fmt.Errorf("get kubeconfig from file %q: %w", c.KubeconfigPath, err)
        }
@@ -84,7 +86,7 @@ func Run(ctx context.Context, c *Config, diagnostic util.ConfigDumpDiagnostic) e
        if err != nil {
                return fmt.Errorf("unable to setup controller options: %w", err)
        }

This also touches on the scale aspect here, as:

if we had more CPU to compress this faster, then it would be better.
if we can optimise the compression ~~routing~~ routine in the API server, this would be better.

We reproduced this in our live AWS cluster (please note that this wasn't done in Kind/locally). KIC started crashing. We then updated KIC image to one built with the above changeset and the pods stabilised. We had close to ~2.1GB Secrets data at this point in time.

EDIT:

At this point, the least KIC can do is to provide a flag to disable compression of responses from the api-server so that such scale issues can be successfully mitigated by customers who hit them.

pmalek commented 2 years ago

That's pretty good news @krish7919 🎉

Given your findings I'd like to propose the following:

We'd like to follow this up with an implementation of a runtime/feature flag (yet to decide which one) which will either:
- disable compression on the whole client (as you've proposed in your diff)
- disable it only for the initial list and the recreate the client with the already established defaults (compression enabled)
This will require us to perform some tests w.r.t the impact this setting might have on KIC's runtime performance as a whole (not only looking at the initial list which we were discussing here)
If possible we'd like to propose that you continue doing your testing and let us know if you encounter any other issues

mflendrich commented 2 years ago

@pmalek could you please create a feature request issue for the runtime/feature flag, reference this issue in that FR issue, and close this? thank you :bow:

pmalek commented 2 years ago

@mflendrich Added an issue for implementation of the flag: #2675

pmalek commented 2 years ago

This is now going to be tracked in #2675 in order to implement a feature flag for disabling compression.

Kong / kubernetes-ingress-controller