cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 213 forks source link

[BBS] Use scheduling info instead of the whole desiredLRP #883

Closed klapkov closed 6 months ago

klapkov commented 8 months ago

Use scheduling info instead of the whole desiredLRP

Summary

There are two places where I see we can make some optimisations: When crashing a ActualLRP: https://github.com/cloudfoundry/bbs/blob/main/controllers/actual_lrp_lifecycle_controller.go#L207

In requestAuction() which is called when we evacuate a Claimed or Running actualLRP: https://github.com/cloudfoundry/bbs/blob/main/controllers/evacuation_controller.go#L387

In both cases we fetch a whole desiredLRP and use only the scheduling info. This could be avoided by directly fetching the scheduling info only. This way we in cases where the RunInfo of an desiredLRP is really large in size, we will not waist resources and time fetching in when we do not use it. The change is small, but I can see some improvements to be gained here.

The only point against the change I see is that sadly the DesiredLRPSchedulingInfo returns an array. so we need to check if it is empty and use the first element if not. This maybe introduce a little bit of confusion and the code is not as clean as before. Anyhow, if anyone has ideas on how to better do this, please leave a comment. Will try to collect some performance data in the following days to share as well.

Diego repo

https://github.com/cloudfoundry/bbs

Describe alternatives you've considered (optional)

Proposal PR: https://github.com/cloudfoundry/bbs/pull/79

winkingturtle-vmw commented 6 months ago

@klapkov Can you please quantify the improvements? I'd like to see a before/after analysis to better understand what we are gaining with this change.

dsabeti commented 6 months ago

Hey @klapkov. I reviewed the PR with @ebroberson , and besides a few style changes (e.g. test descriptions, variable names), we think the PR looks good. We'd definitely like to understand the performance improvements as @winkingturtle-vmw mentioned. Once we see some data about that, we'll be happy to merge the PR.

klapkov commented 6 months ago

Hello @dsabeti and @winkingturtle-vmw,

Sorry for the late response. I did some tests to see what would we gain with this change.

What I did was deploy 150 applications - each with 3 instances and around 2000 app security groups. This means that app of those applications have a very large run_info. Then I tracked how much time it takes to fetch a desiredLRP in the evacuation controller like this: https://github.com/klapkov/bbs/commit/b42b9f6e2bfaf948b8cc488c146ef15cb5115840

I did this for both the current way with desiredLRPByProcessGuid and then with the new endpoint. I triggered a evacuation and observed the logs.

Here are the results from 500 evacuating lrp's:

Using desiredLRPByProcessGuid:

  | Feb 28, 2024 @ 23:38:38.156 | bbs | {"timestamp":"2024-02-28T23:38:38.156950563Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61034.1","time":1.312609736}}

  | Feb 28, 2024 @ 23:38:38.052 | bbs | {"timestamp":"2024-02-28T23:38:38.052007918Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61096.1","time":1.395905079}}

  | Feb 28, 2024 @ 23:38:38.051 | bbs | {"timestamp":"2024-02-28T23:38:38.051072004Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61093.1","time":0.969127116}}

  | Feb 28, 2024 @ 23:38:38.042 | bbs | {"timestamp":"2024-02-28T23:38:38.042025336Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61073.1","time":1.556825014}}

  | Feb 28, 2024 @ 23:38:38.022 | bbs | {"timestamp":"2024-02-28T23:38:38.022147510Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61042.1","time":2.840410677}}

  | Feb 28, 2024 @ 23:38:37.910 | bbs | {"timestamp":"2024-02-28T23:38:37.910684925Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60977.1","time":2.8429662799999997}}

  | Feb 28, 2024 @ 23:38:37.879 | bbs | {"timestamp":"2024-02-28T23:38:37.879589211Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61063.1","time":1.235602573}}

  | Feb 28, 2024 @ 23:38:37.879 | bbs | {"timestamp":"2024-02-28T23:38:37.879539399Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60968.1","time":1.688915266}}

  | Feb 28, 2024 @ 23:38:37.847 | bbs | {"timestamp":"2024-02-28T23:38:37.847163637Z","level":"info","source":"bbs","message":"bbs.request.evacuate-claimed-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61068.1","time":2.727919954}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846022114Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60984.1","time":1.281320477}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846547979Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60980.1","time":2.731387767}}

  | Feb 28, 2024 @ 23:38:37.846 | bbs | {"timestamp":"2024-02-28T23:38:37.846022090Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61019.1","time":2.120372531}}

  | Feb 28, 2024 @ 23:38:37.820 | bbs | {"timestamp":"2024-02-28T23:38:37.820867847Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61059.1","time":1.179043058}}

  | Feb 28, 2024 @ 23:38:37.816 | bbs | {"timestamp":"2024-02-28T23:38:37.816351230Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"60960.1","time":2.585209338}}

  | Feb 28, 2024 @ 23:38:37.816 | bbs | {"timestamp":"2024-02-28T23:38:37.816698606Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evaccontroller","data":{"session":"61001.1","time":1.173293566}}

The average time for 500 LRP's was: 1.48s per LRP

And not here is what I saw using the new endpoint:

  | Feb 29, 2024 @ 09:57:36.834 | bbs | {"timestamp":"2024-02-29T09:57:36.834070858Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69788.1","time":0.053956742}}

  | Feb 29, 2024 @ 09:57:36.832 | bbs | {"timestamp":"2024-02-29T09:57:36.832609309Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69840.1","time":0.05343809}}

  | Feb 29, 2024 @ 09:57:36.831 | bbs | {"timestamp":"2024-02-29T09:57:36.831256636Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69811.1","time":0.048165686}}

  | Feb 29, 2024 @ 09:57:36.831 | bbs | {"timestamp":"2024-02-29T09:57:36.831095012Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69803.1","time":0.063661647}}

  | Feb 29, 2024 @ 09:57:36.829 | bbs | {"timestamp":"2024-02-29T09:57:36.829210118Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69819.1","time":0.063480223}}

  | Feb 29, 2024 @ 09:57:36.828 | bbs | {"timestamp":"2024-02-29T09:57:36.828220971Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69781.1","time":0.055027243}}

  | Feb 29, 2024 @ 09:57:36.828 | bbs | {"timestamp":"2024-02-29T09:57:36.828688531Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69793.1","time":0.055143658}}

  | Feb 29, 2024 @ 09:57:36.827 | bbs | {"timestamp":"2024-02-29T09:57:36.827719274Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69818.1","time":0.061275584}}

  | Feb 29, 2024 @ 09:57:36.826 | bbs | {"timestamp":"2024-02-29T09:57:36.826195576Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69785.1","time":0.056747965}}

  | Feb 29, 2024 @ 09:57:36.826 | bbs | {"timestamp":"2024-02-29T09:57:36.826666573Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69762.1","time":0.051233733}}

  | Feb 29, 2024 @ 09:57:36.825 | bbs | {"timestamp":"2024-02-29T09:57:36.825402685Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69801.1","time":0.048784051}}

  | Feb 29, 2024 @ 09:57:36.824 | bbs | {"timestamp":"2024-02-29T09:57:36.824379962Z","level":"info","source":"bbs","message":"bbs.request.evacuate-running-actual-lrp.elapsed-time-for-evacontroller","data":{"session":"69776.1","time":0.046138081}}

The average time was 0.058s.

So we can see that there is some time we can save here. I also imagine that the load on BBS and the diegoDB will be decreased as well, since we would not need to fetch the large run_info in such cases.

Please let me know if you need anything more. We think this change is worth it.

Regards, Konstantin Lapkov

ebroberson commented 6 months ago

This looks great! I made some small changes for styling and have merged the PR.

klapkov commented 6 months ago

Thanks @ebroberson !