Load test sending MDM profiles to 2,500 macOS hosts

zhumo commented 1 year ago

UPDATE: Results of the load test are in this public Google doc here (noahtalerman 2023-09-29).

Goal

User story
As an IT admin,
I want to be confident that my Fleet instance will be able to handle sending configuration profiles to 2,500 Mac hosts
so that I can roll out Fleet's MDM features with confidence that the migration process will be smooth.

Requirements

Load test enrolling 2,500 Mac hosts with Mac MDM features turned on in Fleet.
- UPDATE: This was done in #11531
Load test 2,500 Mac hosts turned on MDM features.
- UPDATE: This was done in #11531
Load test adding a profile to a team, removing a profile, and changing a profile. We want to make sure all hosts get the right MDM command.
- UPDATE: This was done in #11531
For each of the above, answer the following: Do these events consume too many resources? Do they cause the Fleet database to get stuck?
Test if Apple servers have any rate limiting for these requests. Note that we'll be sending Apple a request with an invalid host serial or UUID
- Use an Apple ID (to generate APNs cert) that isn't used for any other Fleet deployment. Docs on how to generate APNs cert are here.
- There’s a flag in ~~osquery-perf~~ Fleet to turn on hitting Apple servers (actually it's to turn off sending push notifications, by default it will send them - that flag is FLEET_DEV_MDM_APPLE_DISABLE_PUSH)
- In a previous load test using osquery-perf, requests were coming back from Apple as a success.
Make sure Apple doesn't just always return a success for "simulated" (osquery-perf) hosts.
Create a public Google doc with the load test results

NOTE: It's expensive (literally) to test whether 2,500 devices actually get the push notification from APNs. In this pass, we think it's ok to rely on Apple servers actually sending the notification. This test will make sure Fleet is able to tell Apple servers to send the right commands.

Changes

This story doesn't include any changes to the Fleet product.

QA

Risk assessment

[x] Requires load testing TODO

Risk level: Low / High TODO

Risk description: TODO

Automated:

Fleet: Cover / Will not cover
QAWolf: Cover / Will not cover

Manual testing steps

Ensure macOS MDM is turned on and configured for the test environment
Enroll 2500 macOS hosts into the test environment and assign to the same team
Install the MDM enrollment profile on all hosts and verify their MDM Status reflects correctly in Fleet test environment (assuming these are all manual enrollments, the status should reflect On (manual)
In Controls > macOS settings > Custom settings, upload any valid .mobileconfig file to the team to which the test hosts are assigned
Validate that the profiles are deployed successfully via both the Fleet UI and via live query
Validate that no rate limit appears to have been exceeded from APNS
Investigate any errors to determine if they were related to the load test or can be isolated

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
[ ] QA (@____): Added comment to user story confirming succesful completion of QA.

lukeheath commented 1 year ago

@zhumo We already load-tested 2,500 hosts with MDM in the last sprint in preparation for customer migrations with osquery-perf. Is this a duplicate, or do we need to run another load test with different parameters?

zhumo commented 1 year ago

Hey @lukeheath, It looks like the original intent of this issue was obscured through various changes. I've updated it. My understanding of this is that we previously tested for delivering profiles at scale. But I learned that in other deployments, it was the rapid onboarding of hosts that made the MDM instance fall over. So I think this issue was to try to account for that.

@dherder are there certain parameters you might recommend for testing?

dherder commented 1 year ago

When considering what to load test in an enrollment scenario, lots of things can fall apart:

insufficiently provisioned wifi Access Points, saturating your network. This would only be a realistic thing to test for customers enrolling devices at a single location (not everyone is remotely dispersed).
APN servers might be down or could be throttling. Can we explicitly give feedback to the user that APN is ok?
ABM might be down or refuse to send the DEP profile. Again, can we provide end user notification that the service is up and running?
finally, a large amount of enrollments at a single time might cause some unforeseen result in the fleet server. If that happens, how does the end user that is enrolling devices, or the fleet admin get visibility into troubleshooting this?

Here's a thread where this same thing happened: https://community.jamf.com/t5/jamf-pro/enrollments-not-completing-today/m-p/230571

lukeheath commented 1 year ago

Thanks for the info @dherder

@lucasmrod When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

lucasmrod commented 1 year ago

When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles?

Yes. It did load-test MDM enrollment, which consists of: SCEP enroll + Authenticate message + TokenUpdate message)

Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

No. I explicitly ran Fleet with FLEET_DEV_MDM_APPLE_DISABLE_PUSH=1 because we are using simulated MDM devices with fake UUIDs and fake APNS Tokens. (I didn't want Apple to denylist or flag our account because of so many APNS requests with invalid tokens.)

zhumo commented 1 year ago

Hi @noahtalerman , this story did not make it into the current sprint, so I'm de-prioritizing it. Please bring it back to FF if necessary.

noahtalerman commented 1 year ago

Hey @roperzh I updated this story's description based on our conversation. When you get the chance, can you please review the description?

Please let me know if I'm missing anything. We want to bring this to estimation tomorrow.

cc @georgekarrv

roperzh commented 1 year ago

@noahtalerman @georgekarrv the issue description looks great 👏

noahtalerman commented 1 year ago

Alright! @georgekarrv I moved this story over to the designed column and assigned you.

georgekarrv commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @ghernandez345 @gillespi314 @marcosd4h @roperzh

lukeheath commented 1 year ago

@sabrinabuckets Would you please populate the QA section of this story before it comes into a release? Thank you!

noahtalerman commented 1 year ago

Hey @zhumo this didn't make it in to the upcoming sprint. I added this back to FF because we should get to this at the start of next sprint to give us time to react to any issues.

georgekarrv commented 1 year ago

previously related load test w/o APNS interaction https://github.com/fleetdm/confidential/issues/2644

noahtalerman commented 1 year ago

Noah and Martin: If we did this, we would get some level of confidence about rate limiting with bogus claimer. We've already tested load on Fleet server with 2,500 hosts.

Noah and Martin: If/when we do a test with 2,500 real Macs, we won't learn anything new about the load on the Fleet server for manual enrollment and installing profiles. We would get higher level of confidence for the rate limiting. We'd also learn about Fleet server's scalability for DEP enrollment (need real serials) and rate limiting by Apple.

Martin: Fleet server delivering the DEP profile could be an issue. Have a hunch that this will be fine at 2,500 hosts.
Martin: If we were to watch customers instance we could watch for slow downs regarding HTTP requests (Fleet server is busy). Memory usage of Fleet instances. Load is reasonable on Fleet DB.

mna commented 1 year ago

@noahtalerman @georgekarrv I wasn't able to run that load test today as I've run into a Terraform/Docker issue (details in this thread: https://fleetdm.slack.com/archives/C019WG4GH0A/p1694615198703139) but the good news is that it's now fixed and I should be able to get through this test next Monday.

EDIT: fix is confirmed (I managed to setup the loadtest env), so it's looking good to get results on Monday.

mna commented 1 year ago

@noahtalerman @georgekarrv and anyone else interested in the results of this load test:

https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing

noahtalerman commented 1 year ago

Mo: Maybe we test if 2,500 requests for profiles happen at once. This is because Apple will go tell the device to check in with Fleet to get a new profile (device => Fleet).

2,500 hosts are all online when you add a profile and they ask Fleet for the profile the same time. What happens? Do the hosts get the right profiles?

This might have already been tested here: https://github.com/fleetdm/fleet/issues/11531

noahtalerman commented 1 year ago

Thanks @mna! I added a couple questions in the doc.

Overall, it looks like Fleet can handle the Fleet => Apple communication for 2,500 hosts.

Also, Apple servers didn't rate limit Fleet's requests (w/ osquery perf)

I'm still curious about the hosts => Fleet communication (question I left in the doc).

Regarding DEP enrollment, I think we don't need to test to see if Fleet can handle delivering the DEP profile to 2,500 hosts all at once. This isn't a realistic problem. The DEP profile delivery will be spaced out because that's the nature of how organizations migrate.

Regarding doing another test with real Macs, I don't think we need to prioritize this. My understanding is that the untested pieces (do the profiles actually get delivered) are mostly in Apple's control (not ours).

I think we should publish these results. We can clarify the above.

What do you think?

Thoughts?

mna commented 1 year ago

@noahtalerman replied in the doc. Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl and got the error (though it was still applied and saved). Note that I had to rename them to .txt to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.

new-team-profile.mobileconfig.txt new-team-profile2.mobileconfig.txt

mna commented 1 year ago

@noahtalerman

I think we should publish these results. We can clarify the above.

Do you mean to make the Google Doc public, or to post it somewhere else? I can make it publicly accessible if you're ok with that, no problem.

noahtalerman commented 1 year ago

Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl and got the error (though it was still applied and saved). Note that I had to rename them to .txt to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.

@sabrinabuckets when you wrap up testing for the release, can you please help us sanity check if there's a bug in Fleet? Linking to the configuration profiles in a comment here.

noahtalerman commented 1 year ago

@mna yes I think we should make it public. Before we make both public, can you please make sure we address all comments in the docs.

I tried to pull your load test Google doc into a summary in a separate doc here: https://docs.google.com/document/d/1Fqkb-dA3_bv7sNmR54MvYC5SlDmpxk6s4zS7JFo-q74/edit

My thinking is we send users (via somewhere in docs) to this^ document for a high level summary.

You'll see that I link to your document if users want a deep dive into how we conducted the load test.

Please let me know if you have any thoughts/feedback on my summary doc.

After this, I think we can call this issue done.

sabrinabuckets commented 1 year ago

@noahtalerman or @mna I'm unclear what is being referenced in Noah's request check if there's a bug in Fleet. What is the bug in question here?

mna commented 1 year ago

@noahtalerman I accepted your formatting and edits changes in the doc, and reviewed your summary doc (left one comment). I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)

@sabrinabuckets The bug is something I encountered while running the load test, it's been edited out of the latest version of the Google doc but this is what I documented:

As part of "Add a custom profile to the Ducks team":
- Via the Web UI, it always failed with “Couldn’t upload. The file should be a .mobileconfig file” (even though the file is new-team-profile.mobileconfig and is indeed valid plist mobile config file - this was on a Chrome browse on a Fedora laptop)
- Via fleetctl , got this output even though the profile did get applied:
```
$ ./build/fleetctl apply -context loadtest -f ~/Documents/FleetDM/11997-mdm-load-test/tmconfig-profiles.yml 
Error: applying custom settings for team "Ducks": POST /api/latest/fleet/mdm/apple/profiles/batch received status 422 Validation Failed: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
```

The profiles in question are attached in this comment (I tried the first one in the UI only and it failed, and the second one in the UI and the CLI, and it failed in both though the CLI did apply the profile despite the error): https://github.com/fleetdm/fleet/issues/11997#issuecomment-1725439311

sabrinabuckets commented 1 year ago

@mna I can reproduce that error with both of your test profiles & some of my own, on both Linux and Windows. However, on macOS I am able to upload your .mobileconfig files without issue.

mna commented 1 year ago

@sabrinabuckets Waah yeah this is definitely a bug then. Very weird that those validations differ based on platform!

sabrinabuckets commented 1 year ago

@mna very strange. Do you need me to file the bug or do any additional validations?

mna commented 1 year ago

@sabrinabuckets yeah if you can create the ticket that'd be great, let me know if you want me to add any details, but at this point I think you have more than me. Did you also run into the fleetctl issues, or just the web UI one?

sabrinabuckets commented 1 year ago

@mna Just the UI. My Fleet server is running on my MacBook, so I didn't have fleetctl configured elsewhere. I'll get the bug filed and then figure out how to test that.

noahtalerman commented 1 year ago

I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)

@mna can you please make this editable by everyone @fleetdm.com? Thanks :)

mna commented 1 year ago

@noahtalerman I think this should be done now, let me know if it works (wasn't super obvious how to have different sets of access for public and fleetdm.com!)

noahtalerman commented 1 year ago

@mna it works! Thanks

noahtalerman commented 1 year ago

The issue description was updated to link to the public Google doc with results from the load test.

The customer was notified in Slack here (internal).

ireedy commented 1 year ago

C&C: confirmed!

fleet-release commented 1 year ago

Load test, sure and swift, Mac hosts thrive with profiles, Confidence uplift.

fleetdm / fleet