Closed zhumo closed 1 year ago
@zhumo We already load-tested 2,500 hosts with MDM in the last sprint in preparation for customer migrations with osquery-perf. Is this a duplicate, or do we need to run another load test with different parameters?
Hey @lukeheath, It looks like the original intent of this issue was obscured through various changes. I've updated it. My understanding of this is that we previously tested for delivering profiles at scale. But I learned that in other deployments, it was the rapid onboarding of hosts that made the MDM instance fall over. So I think this issue was to try to account for that.
@dherder are there certain parameters you might recommend for testing?
When considering what to load test in an enrollment scenario, lots of things can fall apart:
Here's a thread where this same thing happened: https://community.jamf.com/t5/jamf-pro/enrollments-not-completing-today/m-p/230571
Thanks for the info @dherder
@lucasmrod When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?
When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?
When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles?
Yes. It did load-test MDM enrollment, which consists of: SCEP enroll + Authenticate message + TokenUpdate message)
Do you know if sending these profiles to osquery-perf hosts triggered APN calls?
No. I explicitly ran Fleet with FLEET_DEV_MDM_APPLE_DISABLE_PUSH=1
because we are using simulated MDM devices with fake UUIDs and fake APNS Tokens. (I didn't want Apple to denylist or flag our account because of so many APNS requests with invalid tokens.)
Hi @noahtalerman , this story did not make it into the current sprint, so I'm de-prioritizing it. Please bring it back to FF if necessary.
Hey @roperzh I updated this story's description based on our conversation. When you get the chance, can you please review the description?
Please let me know if I'm missing anything. We want to bring this to estimation tomorrow.
cc @georgekarrv
@noahtalerman @georgekarrv the issue description looks great 👏
Alright! @georgekarrv I moved this story over to the designed column and assigned you.
Hey team! Please add your planning poker estimate with Zenhub @ghernandez345 @gillespi314 @marcosd4h @roperzh
@sabrinabuckets Would you please populate the QA section of this story before it comes into a release? Thank you!
Hey @zhumo this didn't make it in to the upcoming sprint. I added this back to FF because we should get to this at the start of next sprint to give us time to react to any issues.
previously related load test w/o APNS interaction https://github.com/fleetdm/confidential/issues/2644
Noah and Martin: If we did this, we would get some level of confidence about rate limiting with bogus claimer. We've already tested load on Fleet server with 2,500 hosts.
Noah and Martin: If/when we do a test with 2,500 real Macs, we won't learn anything new about the load on the Fleet server for manual enrollment and installing profiles. We would get higher level of confidence for the rate limiting. We'd also learn about Fleet server's scalability for DEP enrollment (need real serials) and rate limiting by Apple.
@noahtalerman @georgekarrv I wasn't able to run that load test today as I've run into a Terraform/Docker issue (details in this thread: https://fleetdm.slack.com/archives/C019WG4GH0A/p1694615198703139) but the good news is that it's now fixed and I should be able to get through this test next Monday.
EDIT: fix is confirmed (I managed to setup the loadtest env), so it's looking good to get results on Monday.
@noahtalerman @georgekarrv and anyone else interested in the results of this load test:
https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing
Mo: Maybe we test if 2,500 requests for profiles happen at once. This is because Apple will go tell the device to check in with Fleet to get a new profile (device => Fleet).
2,500 hosts are all online when you add a profile and they ask Fleet for the profile the same time. What happens? Do the hosts get the right profiles?
Thanks @mna! I added a couple questions in the doc.
Overall, it looks like Fleet can handle the Fleet => Apple communication for 2,500 hosts.
Also, Apple servers didn't rate limit Fleet's requests (w/ osquery perf)
I'm still curious about the hosts => Fleet communication (question I left in the doc).
Regarding DEP enrollment, I think we don't need to test to see if Fleet can handle delivering the DEP profile to 2,500 hosts all at once. This isn't a realistic problem. The DEP profile delivery will be spaced out because that's the nature of how organizations migrate.
Regarding doing another test with real Macs, I don't think we need to prioritize this. My understanding is that the untested pieces (do the profiles actually get delivered) are mostly in Apple's control (not ours).
I think we should publish these results. We can clarify the above.
What do you think?
Thoughts?
@noahtalerman replied in the doc. Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl
and got the error (though it was still applied and saved). Note that I had to rename them to .txt
to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.
new-team-profile.mobileconfig.txt new-team-profile2.mobileconfig.txt
@noahtalerman
I think we should publish these results. We can clarify the above.
Do you mean to make the Google Doc public, or to post it somewhere else? I can make it publicly accessible if you're ok with that, no problem.
Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl and got the error (though it was still applied and saved). Note that I had to rename them to .txt to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.
@sabrinabuckets when you wrap up testing for the release, can you please help us sanity check if there's a bug in Fleet? Linking to the configuration profiles in a comment here.
@mna yes I think we should make it public. Before we make both public, can you please make sure we address all comments in the docs.
I tried to pull your load test Google doc into a summary in a separate doc here: https://docs.google.com/document/d/1Fqkb-dA3_bv7sNmR54MvYC5SlDmpxk6s4zS7JFo-q74/edit
My thinking is we send users (via somewhere in docs) to this^ document for a high level summary.
You'll see that I link to your document if users want a deep dive into how we conducted the load test.
Please let me know if you have any thoughts/feedback on my summary doc.
After this, I think we can call this issue done.
@noahtalerman or @mna I'm unclear what is being referenced in Noah's request check if there's a bug in Fleet
. What is the bug in question here?
@noahtalerman I accepted your formatting and edits changes in the doc, and reviewed your summary doc (left one comment). I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)
@sabrinabuckets The bug is something I encountered while running the load test, it's been edited out of the latest version of the Google doc but this is what I documented:
new-team-profile.mobileconfig
and is indeed valid plist
mobile config file - this was on a Chrome browse on a Fedora laptop)fleetctl
, got this output even though the profile did get applied:
$ ./build/fleetctl apply -context loadtest -f ~/Documents/FleetDM/11997-mdm-load-test/tmconfig-profiles.yml
Error: applying custom settings for team "Ducks": POST /api/latest/fleet/mdm/apple/profiles/batch received status 422 Validation Failed: Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
The profiles in question are attached in this comment (I tried the first one in the UI only and it failed, and the second one in the UI and the CLI, and it failed in both though the CLI did apply the profile despite the error): https://github.com/fleetdm/fleet/issues/11997#issuecomment-1725439311
@mna I can reproduce that error with both of your test profiles & some of my own, on both Linux and Windows. However, on macOS I am able to upload your .mobileconfig
files without issue.
@sabrinabuckets Waah yeah this is definitely a bug then. Very weird that those validations differ based on platform!
@mna very strange. Do you need me to file the bug or do any additional validations?
@sabrinabuckets yeah if you can create the ticket that'd be great, let me know if you want me to add any details, but at this point I think you have more than me. Did you also run into the fleetctl
issues, or just the web UI one?
@mna Just the UI. My Fleet server is running on my MacBook, so I didn't have fleetctl configured elsewhere. I'll get the bug filed and then figure out how to test that.
I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)
@mna can you please make this editable by everyone @fleetdm.com? Thanks :)
@noahtalerman I think this should be done now, let me know if it works (wasn't super obvious how to have different sets of access for public and fleetdm.com!)
@mna it works! Thanks
The issue description was updated to link to the public Google doc with results from the load test.
The customer was notified in Slack here (internal).
C&C: confirmed!
Load test, sure and swift, Mac hosts thrive with profiles, Confidence uplift.
UPDATE: Results of the load test are in this public Google doc here (noahtalerman 2023-09-29).
Goal
Requirements
osquery-perfFleet to turn on hitting Apple servers (actually it's to turn off sending push notifications, by default it will send them - that flag isFLEET_DEV_MDM_APPLE_DISABLE_PUSH
)NOTE: It's expensive (literally) to test whether 2,500 devices actually get the push notification from APNs. In this pass, we think it's ok to rely on Apple servers actually sending the notification. This test will make sure Fleet is able to tell Apple servers to send the right commands.
Changes
This story doesn't include any changes to the Fleet product.
QA
Risk assessment
Risk level: Low / High TODO
Risk description: TODO
Automated:
Manual testing steps
On (manual)
.mobileconfig
file to the team to which the test hosts are assignedTesting notes
Confirmation