Sustained High CPU when scaling loadtest

mostlikelee commented 1 month ago

Goal

User story
As an engineer running Fleet loadtests
I want to create a stable large loadtest environment in a short amount of time
so that I can receive loadtest feedback quickly and with lower cost.

Context

Requestor(s): @mostlikelee
Product designer: _____

Using the existing script to scale up 100K osquery-perf agents over ~30 min resulted in sustained high Fleet CPU and DB reader CPU. As a workaround, I updated the script to scale up over 2 hours with success. The original cadence worked in the past, so there may be a performance regression when adding too many agents too quickly to Fleet. I imagine this is not affecting large customers because they are not adding agents in this quickly, but it has a high impact on test velocity.

Changes

Product

[ ] UI changes: TODO
[ ] CLI (fleetctl) usage changes: TODO
[ ] YAML changes: TODO
[ ] REST API changes: TODO
[ ] Fleet's agent (fleetd) changes: TODO
[ ] Activity changes: TODO
[ ] Permissions changes: TODO
[ ] Changes to paid features or tiers: TODO
[ ] Other reference documentation changes: TODO
[ ] Once shipped, requester has been notified

Engineering

[ ] Feature guide changes: TODO
[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

lukeheath commented 1 month ago

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board and assigning to @sharon-fdm for estimation.

so there may be a performance regression when adding too many agents too quickly to Fleet

I'm seeing a series of issues coming through since 4.56.0 with reports of new performance issues (#22291, #22122) that seem related. It's important we dig in and figure out why this is happening before it results in a production issue.

RachelElysia commented 1 month ago

@lucasmrod : When an agent enrolls to Fleet, it will send every query/policy to those hosts, so if we're enrolling 100k hosts fast, we have a lot of DB writers happening. We should optimize or look for a work around in the load test environment.

The original cadence worked in the past, so there may be a performance regression

-- Tim

I'm seeing a series of issues coming through since 4.56.0

-- Luke

@iansltx : This looks like it may be a regression. We should go back to 4.55 and confirm.

@xpkoala we currently spin up 8 containers to spin up hosts incrementally and avoid crashing, but at one point in time we didn't need to

Next steps:

Reproduce: Spin up 100k hosts, non-incremental host enrollment, on 4.58 and 4.55 to compare (@xpkoala volunteering)
(Timebox) Engineering: discovery of root issue(s)
Solutions

Estimation for steps 1 and 2 only: 5 points Estimate solution separately when we can

fleetdm / fleet