fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.15k stars 431 forks source link

Load test sending MDM profiles to 2,500 macOS hosts #11997

Closed zhumo closed 1 year ago

zhumo commented 1 year ago

UPDATE: Results of the load test are in this public Google doc here (noahtalerman 2023-09-29).

Goal

User story
As an IT admin,
I want to be confident that my Fleet instance will be able to handle sending configuration profiles to 2,500 Mac hosts
so that I can roll out Fleet's MDM features with confidence that the migration process will be smooth.

Requirements

NOTE: It's expensive (literally) to test whether 2,500 devices actually get the push notification from APNs. In this pass, we think it's ok to rely on Apple servers actually sending the notification. This test will make sure Fleet is able to tell Apple servers to send the right commands.

Changes

This story doesn't include any changes to the Fleet product.

QA

Risk assessment

Risk level: Low / High TODO

Risk description: TODO

Automated:

Manual testing steps

  1. Ensure macOS MDM is turned on and configured for the test environment
  2. Enroll 2500 macOS hosts into the test environment and assign to the same team
  3. Install the MDM enrollment profile on all hosts and verify their MDM Status reflects correctly in Fleet test environment (assuming these are all manual enrollments, the status should reflect On (manual)
  4. In Controls > macOS settings > Custom settings, upload any valid .mobileconfig file to the team to which the test hosts are assigned
  5. Validate that the profiles are deployed successfully via both the Fleet UI and via live query
  6. Validate that no rate limit appears to have been exceeded from APNS
  7. Investigate any errors to determine if they were related to the load test or can be isolated

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming succesful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming succesful completion of QA.
lukeheath commented 1 year ago

@zhumo We already load-tested 2,500 hosts with MDM in the last sprint in preparation for customer migrations with osquery-perf. Is this a duplicate, or do we need to run another load test with different parameters?

zhumo commented 1 year ago

Hey @lukeheath, It looks like the original intent of this issue was obscured through various changes. I've updated it. My understanding of this is that we previously tested for delivering profiles at scale. But I learned that in other deployments, it was the rapid onboarding of hosts that made the MDM instance fall over. So I think this issue was to try to account for that.

@dherder are there certain parameters you might recommend for testing?

dherder commented 1 year ago

When considering what to load test in an enrollment scenario, lots of things can fall apart:

Here's a thread where this same thing happened: https://community.jamf.com/t5/jamf-pro/enrollments-not-completing-today/m-p/230571

lukeheath commented 1 year ago

Thanks for the info @dherder

@lucasmrod When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

lucasmrod commented 1 year ago

When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles? Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

When you ran the previous MDM load tests, would it have load-tested enrollment or only sending profiles?

Yes. It did load-test MDM enrollment, which consists of: SCEP enroll + Authenticate message + TokenUpdate message)

Do you know if sending these profiles to osquery-perf hosts triggered APN calls?

No. I explicitly ran Fleet with FLEET_DEV_MDM_APPLE_DISABLE_PUSH=1 because we are using simulated MDM devices with fake UUIDs and fake APNS Tokens. (I didn't want Apple to denylist or flag our account because of so many APNS requests with invalid tokens.)

zhumo commented 1 year ago

Hi @noahtalerman , this story did not make it into the current sprint, so I'm de-prioritizing it. Please bring it back to FF if necessary.

noahtalerman commented 1 year ago

Hey @roperzh I updated this story's description based on our conversation. When you get the chance, can you please review the description?

Please let me know if I'm missing anything. We want to bring this to estimation tomorrow.

cc @georgekarrv

roperzh commented 1 year ago

@noahtalerman @georgekarrv the issue description looks great 👏

noahtalerman commented 1 year ago

Alright! @georgekarrv I moved this story over to the designed column and assigned you.

georgekarrv commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @ghernandez345 @gillespi314 @marcosd4h @roperzh

lukeheath commented 1 year ago

@sabrinabuckets Would you please populate the QA section of this story before it comes into a release? Thank you!

noahtalerman commented 1 year ago

Hey @zhumo this didn't make it in to the upcoming sprint. I added this back to FF because we should get to this at the start of next sprint to give us time to react to any issues.

georgekarrv commented 1 year ago

previously related load test w/o APNS interaction https://github.com/fleetdm/confidential/issues/2644

noahtalerman commented 1 year ago

Noah and Martin: If we did this, we would get some level of confidence about rate limiting with bogus claimer. We've already tested load on Fleet server with 2,500 hosts.

Noah and Martin: If/when we do a test with 2,500 real Macs, we won't learn anything new about the load on the Fleet server for manual enrollment and installing profiles. We would get higher level of confidence for the rate limiting. We'd also learn about Fleet server's scalability for DEP enrollment (need real serials) and rate limiting by Apple.

mna commented 1 year ago

@noahtalerman @georgekarrv I wasn't able to run that load test today as I've run into a Terraform/Docker issue (details in this thread: https://fleetdm.slack.com/archives/C019WG4GH0A/p1694615198703139) but the good news is that it's now fixed and I should be able to get through this test next Monday.

EDIT: fix is confirmed (I managed to setup the loadtest env), so it's looking good to get results on Monday.

mna commented 1 year ago

@noahtalerman @georgekarrv and anyone else interested in the results of this load test:

https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing

noahtalerman commented 1 year ago

Mo: Maybe we test if 2,500 requests for profiles happen at once. This is because Apple will go tell the device to check in with Fleet to get a new profile (device => Fleet).

2,500 hosts are all online when you add a profile and they ask Fleet for the profile the same time. What happens? Do the hosts get the right profiles?

noahtalerman commented 1 year ago

Thanks @mna! I added a couple questions in the doc.

Overall, it looks like Fleet can handle the Fleet => Apple communication for 2,500 hosts.

Also, Apple servers didn't rate limit Fleet's requests (w/ osquery perf)

I'm still curious about the hosts => Fleet communication (question I left in the doc).

Regarding DEP enrollment, I think we don't need to test to see if Fleet can handle delivering the DEP profile to 2,500 hosts all at once. This isn't a realistic problem. The DEP profile delivery will be spaced out because that's the nature of how organizations migrate.

Regarding doing another test with real Macs, I don't think we need to prioritize this. My understanding is that the untested pieces (do the profiles actually get delivered) are mostly in Apple's control (not ours).

I think we should publish these results. We can clarify the above.

What do you think?

Thoughts?

mna commented 1 year ago

@noahtalerman replied in the doc. Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl and got the error (though it was still applied and saved). Note that I had to rename them to .txt to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.

new-team-profile.mobileconfig.txt new-team-profile2.mobileconfig.txt

mna commented 1 year ago

@noahtalerman

I think we should publish these results. We can clarify the above.

Do you mean to make the Google Doc public, or to post it somewhere else? I can make it publicly accessible if you're ok with that, no problem.

noahtalerman commented 1 year ago

Here are the .mobileconfig files I used for the UI failure, and the second one is the one I also used in fleetctl and got the error (though it was still applied and saved). Note that I had to rename them to .txt to be able to upload them to github, but that extension was not there when I tried to upload them in Fleet, of course.

@sabrinabuckets when you wrap up testing for the release, can you please help us sanity check if there's a bug in Fleet? Linking to the configuration profiles in a comment here.

noahtalerman commented 1 year ago

@mna yes I think we should make it public. Before we make both public, can you please make sure we address all comments in the docs.

I tried to pull your load test Google doc into a summary in a separate doc here: https://docs.google.com/document/d/1Fqkb-dA3_bv7sNmR54MvYC5SlDmpxk6s4zS7JFo-q74/edit

My thinking is we send users (via somewhere in docs) to this^ document for a high level summary.

You'll see that I link to your document if users want a deep dive into how we conducted the load test.

Please let me know if you have any thoughts/feedback on my summary doc.

After this, I think we can call this issue done.

sabrinabuckets commented 1 year ago

@noahtalerman or @mna I'm unclear what is being referenced in Noah's request check if there's a bug in Fleet. What is the bug in question here?

mna commented 1 year ago

@noahtalerman I accepted your formatting and edits changes in the doc, and reviewed your summary doc (left one comment). I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)

@sabrinabuckets The bug is something I encountered while running the load test, it's been edited out of the latest version of the Google doc but this is what I documented:



The profiles in question are attached in this comment (I tried the first one in the UI only and it failed, and the second one in the UI and the CLI, and it failed in both though the CLI did apply the profile despite the error): https://github.com/fleetdm/fleet/issues/11997#issuecomment-1725439311

sabrinabuckets commented 1 year ago

@mna I can reproduce that error with both of your test profiles & some of my own, on both Linux and Windows. However, on macOS I am able to upload your .mobileconfig files without issue.

mna commented 1 year ago

@sabrinabuckets Waah yeah this is definitely a bug then. Very weird that those validations differ based on platform!

sabrinabuckets commented 1 year ago

@mna very strange. Do you need me to file the bug or do any additional validations?

mna commented 1 year ago

@sabrinabuckets yeah if you can create the ticket that'd be great, let me know if you want me to add any details, but at this point I think you have more than me. Did you also run into the fleetctl issues, or just the web UI one?

sabrinabuckets commented 1 year ago

@mna Just the UI. My Fleet server is running on my MacBook, so I didn't have fleetctl configured elsewhere. I'll get the bug filed and then figure out how to test that.

noahtalerman commented 1 year ago

I'll make my doc public so you're not stuck waiting for me to do it later this week. (publicly viewable link: https://docs.google.com/document/d/1gVT995Bcaotd9TAIWy6izwPp7E54rkrEPmHyNlqVoXQ/edit?usp=sharing)

@mna can you please make this editable by everyone @fleetdm.com? Thanks :)

mna commented 1 year ago

@noahtalerman I think this should be done now, let me know if it works (wasn't super obvious how to have different sets of access for public and fleetdm.com!)

noahtalerman commented 1 year ago

@mna it works! Thanks

noahtalerman commented 1 year ago

The issue description was updated to link to the public Google doc with results from the load test.

The customer was notified in Slack here (internal).

ireedy commented 1 year ago

C&C: confirmed!

fleet-release commented 1 year ago

Load test, sure and swift, Mac hosts thrive with profiles, Confidence uplift.