Closed njbennett closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/175423091
The labels on this github issue will be updated when the story is started.
Hi @njbennett, that's a huge cluster :) Maybe it is worthy to x-pair on that with one of us?
Sure, where/how should we co-ordinate that? I'm in PST so I assume we don't have many overlapping working hours, and I'll need a bit of heads up time to make sure the test cluster is in the appropriate state.
I think best way to coordinate is slack #eirini-dev channel
Hi @njbennett, we've just had a look at this.
We've seen that memory usage of eirini-controller, task-reporter and event-reporter grows linearly with the number of pods in the cluster. It also drops when pods are deleted, so we don't think we are leaking any memory. The memory usage comes from the controller-runtime cache.
We have resource requests and limits configurable in the eirini helm values file. The defaults are fine for a small CF deployment, but will need to be increased for larger deployments. For example, with 200 2-instance catnips deployed, eirini-controller memory usage is around 67Mb.
Would bumping these resource limits as part of the cf-for-k8s deployment work for you (i.e. setting helm values for eirini, or applying YTT overlays)?
@njbennett we will close this issue for now. Feel free to re-open the issue in case you still encounter this problem.
Description
On the CAPI team we've been running scale tests with, the goal of validating that cf-for-k8s can run up to 2000 app instances. When large numbers of apps are running on the cluster
Steps to reproduce
What was expected to happen
Either the cluster would work... or it would fail in a clear way
What actually happened
Push succeeds, apps appear to be running as stateful sets (Routing to pushed apps appears not to be working but we don't think that's related to Eirini, still tracking down the cause and will report back)
However, further deploys with
kapp
cannot be fully complete succesfully, because Eirini components with queues are repeatedly OOMKilled and then enter CrashLoopBackOff status. It's a little tricky to tell this has happened from e.g. k9s because they spend most of their time in CrashLoopBackOff and only briefly enter OOMKilled status, so you need to be inspecting them directly for a minute or two, or logging your cluster events somewhere.Additionally, there are minimal logs from the affected components.
Suggested fix (optional)
Provide guidance for increasing memory limits for 2000 AI clusters
Ideally though, the queueing components would at least emit warnings if their queues were getting unreasonably long.
Additional information (optional)
We used the following script to generate load