Load test for Windows MDM check-in

mna commented 10 months ago

Goal

User story
As Fleet the organization,
we want to test how Fleet performs w/ 20,000 hosts using Windows MDM features
so that customers with large Windows fleets can successfully use Windows MDM features.

Changes

Product

[x] Documentation changes: Google doc that describes how the Fleet server performs w/ ~~20,000~~ 10,000 hosts using Windows MDM features.
- We think CPU/memory/request throughput and the Mysql DB throughput are mostly likely to be affected. Unlikely for Redis to be affected.
- If the Fleet server doesn't perform up to our standards, then add recommendation on the changes we should make.

Engineering

[x] Update osquery-perf to support simulating Windows MDM and check-ins at regular intervals (for this load test, it doesn't need to understand and handle the Fleet's response - e.g. to install custom profiles and such - , just to make valid MDM check-in requests). It should be possible to reuse the mdmtest package used for integration tests. https://github.com/fleetdm/fleet/blob/2e497c2277b76379faeca1b57cdf3daf4ef1240a/pkg/mdm/mdmtest/windows.go#L83-L87
[x] Load testing: Perform load test w/ 20,000 simulated Windows hosts
- Found an issue with a missing DB index on mdm_windows_enrollments.mdm_device_id, at 5K hosts on the "up to 25K" reference architecture the mysql reader is topping at 100% CPU and Fleet becomes unusable.
[x] Fix: add missing DB index to mdm_windows_enrollments.mdm_device_id and run another load test

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

For https://github.com/fleetdm/fleet/issues/15408, we scheduled Windows MDM hosts to check-in every minute with the Fleet server to initiate an MDM session.

Most of the time, those check-ins will be a single API request with nothing more to do, so from the host's perspective, the load is minimal and shouldn't be a concern, but at scale, those frequent check-ins with thousands of enrolled Windows MDM hosts could be a concern for the Fleet server.

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

mna commented 10 months ago

@noahtalerman Will assign to you for prioritization in the next sprint as discussed on Slack.

I haven't tagged it as a bug even though it is a follow-up to a ticket that was a but (https://github.com/fleetdm/fleet/issues/15408), nor a sub-task because the other ticket was not a story. Feel free to rearrange the labels if you feel it is wrongly categorized.

noahtalerman commented 9 months ago

@georgekarrv heads up, we moved this story to "Settled"