fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.09k stars 426 forks source link

Load test for Windows MDM check-in #16120

Open mna opened 9 months ago

mna commented 9 months ago

Goal

User story
As Fleet the organization,
we want to test how Fleet performs w/ 20,000 hosts using Windows MDM features
so that customers with large Windows fleets can successfully use Windows MDM features.

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

Context

For https://github.com/fleetdm/fleet/issues/15408, we scheduled Windows MDM hosts to check-in every minute with the Fleet server to initiate an MDM session.

Most of the time, those check-ins will be a single API request with nothing more to do, so from the host's perspective, the load is minimal and shouldn't be a concern, but at scale, those frequent check-ins with thousands of enrolled Windows MDM hosts could be a concern for the Fleet server.

QA

Risk assessment

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
mna commented 9 months ago

@noahtalerman Will assign to you for prioritization in the next sprint as discussed on Slack.

I haven't tagged it as a bug even though it is a follow-up to a ticket that was a but (https://github.com/fleetdm/fleet/issues/15408), nor a sub-task because the other ticket was not a story. Feel free to rearrange the labels if you feel it is wrongly categorized.

noahtalerman commented 8 months ago

@georgekarrv heads up, we moved this story to "Settled"

georgekarrv commented 8 months ago

Hey team! Please add your planning poker estimate with Zenhub @ghernandez345 @gillespi314 @mna @roperzh

mna commented 8 months ago

Note to self: set -mdm_check_in_interval=1m for tests to be realistic (same check-in frequency as normal for Windows MDM).

mna commented 7 months ago

Bumping the estimate to 8 since I found a missing DB index on mdm_windows_enrollments.mdm_device_id that's causing Fleet to become unusable at 5K hosts enrolled in Windows MDM on the reference architecture in load testing.

mna commented 7 months ago

@noahtalerman Google doc of the load tests and results (for now only viewable for Fleeties): https://docs.google.com/document/d/1uIuXIdo8AP7JzoNvHRCBmHAw1OKBDM4Xj8CuRhsJA4Q/edit?usp=sharing (will edit the ticket's description to link it there too).

tl; dr;

Keeping in mind of course that those are osquer-perf-simulated hosts and do not exactly reflect real hosts' behavior and load.

mna commented 7 months ago

There was a bug in osquery-perf related to scheduled queries, might have played a bit of a role in the results, need to understand better before confirming: https://github.com/fleetdm/fleet/pull/17576/files#r1522096735

mna commented 7 months ago

@noahtalerman I did a "quick" sanity check (as quick as can be for a load test - the ramp up to 15K hosts with Windows MDM enabled) to see if the newly-discovered old osquery-perf bug had a significant impact in the load test results and conclusions (deploying this branch).

Seeing how the Mysql reader is still peaking at 100% CPU after 15K hosts, it's safe to say that the results were not significantly impacted and still stand. See also the mysql reader's metrics and dimensions which are very similar to the ones seen in the load test at 15K hosts.

Screenshot from 2024-03-13 09-46-31

So this concludes work on this ticket, results are valid.

noahtalerman commented 7 months ago

it's safe to say that the results were not significantly impacted and still stand.

@mna thanks! The Google doc looks great.

Screenshot 2024-03-13 at 12 11 27 PM

Screenshot 2024-03-13 at 12 12 48 PM

Yeah, anything load time over 5 seconds is too slow.

Screenshot 2024-03-13 at 12 13 41 PM

Screenshot 2024-03-13 at 12 14 23 PM

I pulled out these highlights.

Sounds the next step is to update our reference architecture. I think we want one reference architecture that supports Fleet w/ and w/o Windows MDM (instead of having two options).

I filed an g-customer-success request for this here: https://github.com/fleetdm/confidential/issues/5765

As for this issue (#16120), I agree we can call it done.

noahtalerman commented 7 months ago

cc @rfairburn ^^