fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.12k stars 431 forks source link

Sustained High CPU when scaling loadtest #22367

Open mostlikelee opened 1 month ago

mostlikelee commented 1 month ago

Goal

User story
As an engineer running Fleet loadtests
I want to create a stable large loadtest environment in a short amount of time
so that I can receive loadtest feedback quickly and with lower cost.

Context

Using the existing script to scale up 100K osquery-perf agents over ~30 min resulted in sustained high Fleet CPU and DB reader CPU. As a workaround, I updated the script to scale up over 2 hours with success. The original cadence worked in the past, so there may be a performance regression when adding too many agents too quickly to Fleet. I imagine this is not affecting large customers because they are not adding agents in this quickly, but it has a high impact on test velocity.

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
lukeheath commented 1 month ago

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board and assigning to @sharon-fdm for estimation.

so there may be a performance regression when adding too many agents too quickly to Fleet

I'm seeing a series of issues coming through since 4.56.0 with reports of new performance issues (#22291, #22122) that seem related. It's important we dig in and figure out why this is happening before it results in a production issue.

RachelElysia commented 1 month ago

@lucasmrod : When an agent enrolls to Fleet, it will send every query/policy to those hosts, so if we're enrolling 100k hosts fast, we have a lot of DB writers happening. We should optimize or look for a work around in the load test environment.

The original cadence worked in the past, so there may be a performance regression

-- Tim

I'm seeing a series of issues coming through since 4.56.0

-- Luke

@iansltx : This looks like it may be a regression. We should go back to 4.55 and confirm.

@xpkoala we currently spin up 8 containers to spin up hosts incrementally and avoid crashing, but at one point in time we didn't need to

Next steps:

  1. Reproduce: Spin up 100k hosts, non-incremental host enrollment, on 4.58 and 4.55 to compare (@xpkoala volunteering)
  2. (Timebox) Engineering: discovery of root issue(s)
  3. Solutions

Estimation for steps 1 and 2 only: 5 points Estimate solution separately when we can