Open mostlikelee opened 1 month ago
@mostlikelee Thanks for filing this. I am prioritizing to the drafting board and assigning to @sharon-fdm for estimation.
so there may be a performance regression when adding too many agents too quickly to Fleet
I'm seeing a series of issues coming through since 4.56.0 with reports of new performance issues (#22291, #22122) that seem related. It's important we dig in and figure out why this is happening before it results in a production issue.
@lucasmrod : When an agent enrolls to Fleet, it will send every query/policy to those hosts, so if we're enrolling 100k hosts fast, we have a lot of DB writers happening. We should optimize or look for a work around in the load test environment.
The original cadence worked in the past, so there may be a performance regression
-- Tim
I'm seeing a series of issues coming through since 4.56.0
-- Luke
@iansltx : This looks like it may be a regression. We should go back to 4.55 and confirm.
@xpkoala we currently spin up 8 containers to spin up hosts incrementally and avoid crashing, but at one point in time we didn't need to
Next steps:
Estimation for steps 1 and 2 only: 5 points Estimate solution separately when we can
Goal
Context
Using the existing script to scale up 100K osquery-perf agents over ~30 min resulted in sustained high Fleet CPU and DB reader CPU. As a workaround, I updated the script to scale up over 2 hours with success. The original cadence worked in the past, so there may be a performance regression when adding too many agents too quickly to Fleet. I imagine this is not affecting large customers because they are not adding agents in this quickly, but it has a high impact on test velocity.
Changes
Product
Engineering
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation