Open mna opened 9 months ago
@noahtalerman Will assign to you for prioritization in the next sprint as discussed on Slack.
I haven't tagged it as a bug
even though it is a follow-up to a ticket that was a but (https://github.com/fleetdm/fleet/issues/15408), nor a sub-task because the other ticket was not a story. Feel free to rearrange the labels if you feel it is wrongly categorized.
@georgekarrv heads up, we moved this story to "Settled"
Hey team! Please add your planning poker estimate with Zenhub @ghernandez345 @gillespi314 @mna @roperzh
Note to self: set -mdm_check_in_interval=1m
for tests to be realistic (same check-in frequency as normal for Windows MDM).
Bumping the estimate to 8
since I found a missing DB index on mdm_windows_enrollments.mdm_device_id
that's causing Fleet to become unusable at 5K hosts enrolled in Windows MDM on the reference architecture in load testing.
@noahtalerman Google doc of the load tests and results (for now only viewable for Fleeties): https://docs.google.com/document/d/1uIuXIdo8AP7JzoNvHRCBmHAw1OKBDM4Xj8CuRhsJA4Q/edit?usp=sharing (will edit the ticket's description to link it there too).
tl; dr;
Keeping in mind of course that those are osquer-perf
-simulated hosts and do not exactly reflect real hosts' behavior and load.
There was a bug in osquery-perf
related to scheduled queries, might have played a bit of a role in the results, need to understand better before confirming: https://github.com/fleetdm/fleet/pull/17576/files#r1522096735
@noahtalerman I did a "quick" sanity check (as quick as can be for a load test - the ramp up to 15K hosts with Windows MDM enabled) to see if the newly-discovered old osquery-perf
bug had a significant impact in the load test results and conclusions (deploying this branch).
Seeing how the Mysql reader is still peaking at 100% CPU after 15K hosts, it's safe to say that the results were not significantly impacted and still stand. See also the mysql reader's metrics and dimensions which are very similar to the ones seen in the load test at 15K hosts.
So this concludes work on this ticket, results are valid.
it's safe to say that the results were not significantly impacted and still stand.
@mna thanks! The Google doc looks great.
Yeah, anything load time over 5 seconds is too slow.
I pulled out these highlights.
Sounds the next step is to update our reference architecture. I think we want one reference architecture that supports Fleet w/ and w/o Windows MDM (instead of having two options).
I filed an g-customer-success request for this here: https://github.com/fleetdm/confidential/issues/5765
As for this issue (#16120), I agree we can call it done.
cc @rfairburn ^^
Goal
Changes
Product
20,00010,000 hosts using Windows MDM features.Engineering
[x] Update
osquery-perf
to support simulating Windows MDM and check-ins at regular intervals (for this load test, it doesn't need to understand and handle the Fleet's response - e.g. to install custom profiles and such - , just to make valid MDM check-in requests). It should be possible to reuse themdmtest
package used for integration tests. https://github.com/fleetdm/fleet/blob/2e497c2277b76379faeca1b57cdf3daf4ef1240a/pkg/mdm/mdmtest/windows.go#L83-L87[x] Load testing: Perform load test w/ 20,000 simulated Windows hosts
mdm_windows_enrollments.mdm_device_id
, at 5K hosts on the "up to 25K" reference architecture the mysql reader is topping at 100% CPU and Fleet becomes unusable.[x] Fix: add missing DB index to
mdm_windows_enrollments.mdm_device_id
and run another load testContext
For https://github.com/fleetdm/fleet/issues/15408, we scheduled Windows MDM hosts to check-in every minute with the Fleet server to initiate an MDM session.
Most of the time, those check-ins will be a single API request with nothing more to do, so from the host's perspective, the load is minimal and shouldn't be a concern, but at scale, those frequent check-ins with thousands of enrolled Windows MDM hosts could be a concern for the Fleet server.
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation