On large clusters, eirini-controller, eirini-events, and eirini-task-reporter get OOMKilled

njbennett commented 4 years ago

Description

On the CAPI team we've been running scale tests with, the goal of validating that cf-for-k8s can run up to 2000 app instances. When large numbers of apps are running on the cluster

Steps to reproduce

deploy cf-for-k8s with the following parameters
- GKE cluster with 100+ nodes
- Eirini components scaled up to 10 replicas
deploy 2000 apps (10 per space)

What was expected to happen

Either the cluster would work... or it would fail in a clear way

What actually happened

Push succeeds, apps appear to be running as stateful sets (Routing to pushed apps appears not to be working but we don't think that's related to Eirini, still tracking down the cause and will report back)

However, further deploys with kapp cannot be fully complete succesfully, because Eirini components with queues are repeatedly OOMKilled and then enter CrashLoopBackOff status. It's a little tricky to tell this has happened from e.g. k9s because they spend most of their time in CrashLoopBackOff and only briefly enter OOMKilled status, so you need to be inspecting them directly for a minute or two, or logging your cluster events somewhere.

Additionally, there are minimal logs from the affected components.

Suggested fix (optional)

Provide guidance for increasing memory limits for 2000 AI clusters

Ideally though, the queueing components would at least emit warnings if their queues were getting unreasonably long.

Additional information (optional)

We used the following script to generate load

#!/usr/bin/env bash

set -euo pipefail

: "${NUMBER_OF_APPS:?}"

# don't change this without also changing scale_suite_test.go
# must be power of 10 (1, 100, 1000, etc)
APPS_PER_SPACE=10
CF_API="https://api.scale-testing.k8s.capi.land"

function login() {
    cf api --skip-ssl-validation "$CF_API"
    CF_USERNAME=admin CF_PASSWORD=$(yq -r '.cf_admin_password' "./cf-values.yml") cf auth
}

function prepare_cf_foundation() {
    cf enable-feature-flag diego_docker
    cf update-quota default -r 3000 -m 3000G
}

function deploy_apps() {
    org_name_prefix="scale-tests"
    space_name_prefix="scale-tests"

    # we subtract 1 here because `seq` is inclusive on both sides
    number_of_org_spaces="$((NUMBER_OF_APPS / APPS_PER_SPACE - 1))"
    number_of_apps_per_org_space="$((NUMBER_OF_APPS / number_of_org_spaces - 1))"

    for n in $(seq 0 ${number_of_org_spaces})
    do
      org_name="${org_name_prefix}-${n}"
      space_name="${space_name_prefix}-${n}"
      cf create-org "${org_name}"
      cf create-space -o "${org_name}" "${space_name}"
      cf target -o "${org_name}" -s "${space_name}"

      for i in $(seq 0 ${number_of_apps_per_org_space})
      do
        name="bin-$((n * APPS_PER_SPACE + i))"
        echo $name
        cf push $name -m 128M -k 256M -i 2 -p ~/workspace/cf-acceptance-tests/assets/catnip -b paketo-buildpacks/go &
        # let's give CF time to push an app, sometimes it uses the next org/space if
        # don't give enough time
        sleep 5
      done
      wait
    done
}

function main() {
    curl -vvv --retry 300 -k "$CF_API"

    login
    prepare_cf_foundation
    deploy_apps
}

main

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175423091

The labels on this github issue will be updated when the story is started.

herrjulz commented 4 years ago

Hi @njbennett, that's a huge cluster :) Maybe it is worthy to x-pair on that with one of us?

njbennett commented 4 years ago

Sure, where/how should we co-ordinate that? I'm in PST so I assume we don't have many overlapping working hours, and I'll need a bit of heads up time to make sure the test cluster is in the appropriate state.

herrjulz commented 4 years ago

I think best way to coordinate is slack #eirini-dev channel

kieron-dev commented 4 years ago

Hi @njbennett, we've just had a look at this.

We've seen that memory usage of eirini-controller, task-reporter and event-reporter grows linearly with the number of pods in the cluster. It also drops when pods are deleted, so we don't think we are leaking any memory. The memory usage comes from the controller-runtime cache.

We have resource requests and limits configurable in the eirini helm values file. The defaults are fine for a small CF deployment, but will need to be increased for larger deployments. For example, with 200 2-instance catnips deployed, eirini-controller memory usage is around 67Mb.

Would bumping these resource limits as part of the cf-for-k8s deployment work for you (i.e. setting helm values for eirini, or applying YTT overlays)?

herrjulz commented 4 years ago

@njbennett we will close this issue for now. Feel free to re-open the issue in case you still encounter this problem.

cloudfoundry / eirini