Listing apps takes a very long time at scale

braunsonm commented 3 years ago

Describe the bug

At larger scales cf app can take 5+ minutes but eventually does return. I'm not doing anything crazy, maybe 25 apps in a space but it takes an extremely long time to return at times.

To Reproduce*

Steps to reproduce the behavior:

Deploy a bunch of apps
Run cf apps a few times and you should notice at least one of them will hang for a long time.

Expected behavior

The apps listing should be quick everytime, not just sometimes.

Additional context

cf-for-k8s SHA

v1.1.0

Cluster information

AKS

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176398421

The labels on this github issue will be updated when the story is started.

jamespollard8 commented 3 years ago

Oh woah, this is fascinating and not yet part of our standard testing.

The first thing I'm interested in checking out is whether I'm able to easily reproduce this using a different IaaS.

Thanks @braunsonm for the report. How much does the run time tend to vary on that cf app command? In total, how many apps and app instances are currently running on your CF?

braunsonm commented 3 years ago

Hey James! Usually varies by a minute depending on when you ran it.

Here's a couple outputs I logged with time. Seems I was lucky here to get under 3 minutes! Sometimes it's more than 5.

cf apps  0.19s user 0.09s system 0% cpu 2:21.51 total
cf apps  0.23s user 0.09s system 0% cpu 2:12.40 total
cf apps  0.14s user 0.11s system 0% cpu 2:09.30 total
cf apps  0.14s user 0.10s system 0% cpu 2:19.94 total

We have ~33 apps deployed right now, slowly increasing day by day until we hit about ~50. Which is why I said in the ticket I don't think we're doing anything crazy here, we aren't at extremely large scales but the command tended to be instant until we started getting more and more apps.

Our cluster CPU usage is at about 11% and memory at about 50%

jamespollard8 commented 3 years ago

First repro attempt: I'm not seeing this behavior on my GKE cluster.

$ time cf apps
Getting apps in org o / space s as admin...

name        requested state   processes   routes
catnip-00   started           web:2/2     catnip-00.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-01   started           web:2/2     catnip-01.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-02   started           web:2/2     catnip-02.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-03   started           web:2/2     catnip-03.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-04   started           web:2/2     catnip-04.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-05   started           web:2/2     catnip-05.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-06   started           web:2/2     catnip-06.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-07   started           web:2/2     catnip-07.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-08   started           web:2/2     catnip-08.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-09   started           web:2/2     catnip-09.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-10   started           web:2/2     catnip-10.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-11   started           web:2/2     catnip-11.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-12   started           web:2/2     catnip-12.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-13   started           web:2/2     catnip-13.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-14   started           web:2/2     catnip-14.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-15   started           web:2/2     catnip-15.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-16   started           web:2/2     catnip-16.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-17   started           web:2/2     catnip-17.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-18   started           web:2/2     catnip-18.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-19   started           web:2/2     catnip-19.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-20   started           web:2/2     catnip-20.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-21   started           web:2/2     catnip-21.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-22   started           web:2/2     catnip-22.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-23   started           web:2/2     catnip-23.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-24   started           web:2/2     catnip-24.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-25   started           web:2/2     catnip-25.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-26   started           web:2/2     catnip-26.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-27   started           web:2/2     catnip-27.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-28   started           web:2/2     catnip-28.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-29   started           web:0/2     catnip-29.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-30   started           web:2/2     catnip-30.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-31   started           web:2/2     catnip-31.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-32   started           web:2/2     catnip-32.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-33   started           web:2/2     catnip-33.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-34   started           web:2/2     catnip-34.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-35   started           web:2/2     catnip-35.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-36   started           web:2/2     catnip-36.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-37   started           web:2/2     catnip-37.apps.adventurous-scorpion.k8s-dev.relint.rocks
catnip-38   started           web:0/2     catnip-38.apps.adventurous-scorpion.k8s-dev.relint.rocks

real    0m24.207s

I stressed the system a bit by running cf apps ten times while constantly pushing 2 apps in parallel. Even then, I saw all 10 attempts finish within 25-35 seconds.

Next up I'll try reproducing this on an AKS cluster.

jamespollard8 commented 3 years ago

Next up I'll try reproducing this on an AKS cluster.

But I'll likely not be able to prioritize this too soon, especially since my first test went well on the GKE cluster with several catnip apps

jamespollard8 commented 3 years ago

Okay, I pushed 40 catnips with 2 instances each onto an AKS cluster. I'm seeing consistent cf apps runtime of ~23 seconds.

I wonder if this is something about the long-lived-ness of your cluster that's at play here. When you inspect the cf-system pods currently running, do you see any with particularly high memory usage?

For reference, here's what I'm seeing on my new AKS cluster with 40 catnips:

braunsonm commented 3 years ago

Not seeing anything unusual. This is during the call to cf apps

jamespollard8 commented 3 years ago

The load on the cf-api-server does look a tiny bit high. I'd recommend first trying: 1) delete the cf-api-server pod and have its deployment bring it back. Then see how running cf apps looks. (I'd be surprised if this made a noticeable difference.)

If the command is still very slow to run: 2) try scaling the number of cf-api-servers to 2. (For reference, when SAP did their load testing of cf-for-k8s, they started with 5 replicas. This is definitely overkill for y'all but still instructive. They also filed some fascinating details here: https://github.com/cloudfoundry/capi-k8s-release/issues/67)

this may be helpful for your overlay: https://github.com/cloudfoundry/cf-for-k8s/blob/develop/config/capi/capi.yml#L70-L74

braunsonm commented 3 years ago

After doing this I'm seeing even longer query times:

cf apps  0.10s user 0.04s system 0% cpu 3:26.82 total

2 . With an additional pod I'm still seeing very slow performance. Actually a bit worse than expected. I also horizontally scaled our cluster up to 6 nodes just to try seeing if a node with barely any requests would perform differently and it didn't help.

cf apps  0.12s user 0.04s system 0% cpu 8:57.16 total
cf apps  0.12s user 0.02s system 0% cpu 4:21.89 total

Almost 9 minutes is pretty high.. Other CLI operations seem file (set-env, listing spaces, targetting, etc)

If it means anything, we do not see any performance issues when navigating our CF deployments with Stratos. Only the CLI seems to have this issue. Also since making this ticket we're now at ~50 apps with more on the way.

Appreciate you trying to get to the root of this!

tcdowney commented 3 years ago

Previously we've done some investigation around this on https://github.com/cloudfoundry/capi-k8s-release/issues/17. Newer versions of the cf7 CLI make better use of api calls which speed things up a bit.. it's possible @jamespollard8 may have the newer version?

If you're still seeing issues one thing to look into is Kubernetes API / Eirini performance. I've never investigated that beyond the hunch, but those components are in the critical path for that request since they provide the count for actual running instances.

I don't expect this to be taxing on the cf-api-server, but it may be bottlenecked by those components.

braunsonm commented 3 years ago

Thanks @tcdowney

These tests use cf version 7.2.0+be4a5ce2b.2020-12-10.

We're seeing what I think are normal API server latencies. Here's a metric from Kubernetes API scraping: (hard to see but about 0.015)

These are in "seconds" of latency with the following query:

histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) by (resource,  subresource, verb, scope, le))

Eirini doesn't seem to support prometheus exports but it is part of the service mesh so I dug into its latency from istio. The majority of requests to it seem to be from cf-api-clock

rate(istio_request_duration_milliseconds_bucket{destination_service_name="eirini"}[15m])

tcdowney commented 3 years ago

After discussing on Slack, a promising theory is that this PR would improve things: https://github.com/cloudfoundry/cloud_controller_ng/pull/1934

For apps with non-running processes (stopped or crashed apps typically) the API is waiting to hear back from Eirini/Kubernetes for the status of those processes and timing out.

Birdrock commented 3 years ago

This issue should be resolved in a tentative v2.1.0 release. v2.0.0 is in RC and contains several fixes. This enhancement requires some additional care, so in the interest of getting fixes out the door, we can land it in a following release.

jspawar commented 3 years ago

@braunsonm

We just merged and bumped changes to the cloud controller which we believe should address this for you. In our limited measurements we noticed that cf apps went from taking around 9-10 minutes to around 40 seconds, so seems like a notable performance improvement.

More details on the measurements we made and how: https://www.pivotaltracker.com/story/show/176398421/comments/221918995

Working on cutting a release to include these changes soon

braunsonm commented 3 years ago

Thanks! @jspawar enjoyed your comments on the story.

jamespollard8 commented 3 years ago

Included in v2.1.0

cloudfoundry / cf-for-k8s