Process for self-hosted macOS MDM migration to Fleet

zwass commented 1 month ago

This issue documents the approach for a staged migration from a self-hosted MDM to Fleet.

Here is a visual representation of the approach:

Before Migration

flowchart LR
subgraph macOS Device
  mdmclient[MDM client]
end
mdmclient -- Routed by DNS <br> (mdm.example.com)-->oldmdm
oldmdm[Existing Self-Hosted MDM Server]
mdmclient ~~~ fleet
fleet[Fleet Server]

During Migration

Change DNS to route to the Migration Proxy.

flowchart LR
subgraph macOS Device
  mdmclient[MDM client]
end
mdmclient-- Routed by DNS <br> (mdm.example.com) -->proxy
subgraph proxy[Migration Proxy]
  migrated{Get UUID from request. <br> Check against configuration. <br> Migrated yet?}
end
migrated -- No <br> Routed by IP (or LB DNS) --> oldmdm
migrated -- Yes <br> Routed by DNS <br> (example.cloud.fleetdm.com)--> fleet
oldmdm[Existing Self-Hosted MDM Server]
fleet[Fleet Server]

After Migration

Change DNS to route to the Fleet Server.

flowchart LR
subgraph macOS Device
  mdmclient[MDM client]
end
mdmclient ~~~ oldmdm
mdmclient -- Routed by DNS <br> (mdm.example.com) -->fleet
oldmdm["`Existing Self-Hosted MDM Server
  (can now be removed)`"]
fleet[Fleet Server]

Assumptions & Prerequisites

The existing (self-hosted) MDM server ("Existing Server") is currently routed by DNS using the FQDN mdm.example.com ("Existing DNS").
The Fleet Server ("Fleet Server") is currently routed by DNS using the FQDN example.cloud.fleetdm.com ("Fleet DNS").
The Migration Proxy ("Migration Proxy") is configured with the Existing Server IP address and Fleet Server IP address as targets.
Fleet server configured with secrets/certs/keys ("Existing Secrets") from the Existing Server (APNS cert/key, SCEP cert/key, TODO: more details)
Customer/user ("Customer") has control of Existing DNS to add and modify records.

Before migration

Test that new enrollments work with the Fleet Server using the Fleet DNS (eg. enroll new devices on example.cloud.fleetdm.com). We want to know that the Fleet Server configured with the Existing Secrets works as expected.
Some basic validation should be done on this before beginning the migration, but further testing can take place in parallel with the migration steps.
Issue an SSL Certificate (eg. with AWS) for the FQDN of the Existing Server (mdm.example.com). In AWS this can be done by Fleet requesting the certificate issuance and the Customer configuring DNS verification.

Migration

Activate the Migration Proxy

Set the Migration Proxy to route all requests to the Existing Server (this is the default configuration).
Cut over Existing DNS to point to the Migration Proxy.
Verify that managed devices continue to successfully check in to the Existing Server.

Migrate devices

Populate the Fleet server with the most recent device data extracted from the Existing Server (TODO: more on how to extract this data).
Configure the Migration Proxy to migrate a single device by UUID.
Verify that the device is managed successfully by the Fleet Server.
Continue to migrate and verify a small set of devices by UUID.
Configure a percentage of devices to migrate (start perhaps with 5%).
Increase percentage of migrated devices until reaching 100%.

Deactivate the Migration Proxy

At this point, all migrated devices are communicating with the Fleet server using Existing DNS.
Cut over Existing DNS to point to the Fleet Server. This removes the Migration Proxy from the path of MDM requests.

After migration

Save a final database backup and any other relevant state from the Existing Server.
The Existing Server can now be taken down.
The Migration Proxy can now be taken down.

noahtalerman commented 2 weeks ago

Hey @zwass can you please bring these wireframes (diagrams in the issue description) to the next MDM design review?

I noticed the draft PR you opened here: #19779

At Fleet, any change to the product is wireframed and goes through a design review before we start writing code: https://fleetdm.com/handbook/company/why-this-way#why-do-we-use-a-wireframe-first-approach

The next MDM design review is tomorrow @ 11:30a EST. If a review needs to happen sooner please schedule some time w/ me. Thanks!

zwass commented 2 weeks ago

@noahtalerman there is no UI to wireframe and this isn't going into the product. How/what would you like to review? I can bring something next week after MDOYVR.

lukeheath commented 2 weeks ago

@zwass Thanks for putting this together; it looks great! I'm labeling this as an engineering-initiated story and placing on the MDM release board for tracking. I put a ballpark estimate of 8 on this (~one week) but feel free to change if that's inaccurate.

A couple of questions:

Will we run the proxy server in the same AWS instance running the cloud server?
- If so, we should ask @rfairburn to add a Terraform definition for the proxy server so all of the infra is defined as code and stored with the rest of the customer's Terraform.
Does each host only need to go through the proxy server once to be migrated?
- I'm wondering how long we expect to support the proxy server before we can safely take it offline. (cc @noahtalerman)

Thanks!

zwass commented 2 weeks ago

This was developed last week in ~3 days so I am changing the estimation label to 5.

Will we run the proxy server in the same AWS instance running the cloud server?

By instance, I'm guessing you mean account? It doesn't really matter where the proxy is run, but for speed I'm currently running it in the solutions consulting AWS account.

Does each host only need to go through the proxy server once to be migrated?

No, all the hosts go through the proxy as long as DNS points to it. Once the migration is complete, DNS can be pointed directly to the Fleet instance and the proxy can be taken down.

lukeheath commented 2 weeks ago

@rfairburn When we run this in production, we'll want to run it in the appropriate customer's AWS account. Any infrastructure necessary to support it should be defined as Terraform (which I assume is the default anyway) so we can track this as a unique part of the customer's Cloud infrastructure, but not intended as a long-lived general use service.

zwass commented 2 weeks ago

@lukeheath @rfairburn if that needs to be done via Terraform please write that Terraform ASAP (or let's pull in @zayhanlon to better understand timelines) so we don't end up holding up the production migration with that.

rfairburn commented 2 weeks ago

Is the current proxy dockerized? Assuming this will live in our cloud, I could stub out an ACM certificate / ALB / ECS service fairly quickly to handle all of the pieces. For time purposes on this one I wouldn't make a terraform module but make it just individual resources. However, if this becomes a recurring pattern, I could make a portable module as well.

Specifics I would need:

are we currently dockerized? If not, do we need me to build an image with the binary and push it to an ecr repo?
env vars passed to proxy (how I will have to configure settings)?
Do I need specialized IAM permissions for any assume-role type tasks?
proxy scaling. Does it support parallel instances? How many containers would we need in the service relative to the number of fleet containers?

I can get something like this worked out fairly quickly, but I haven't seen any of the moving pieces in action.

zwass commented 2 weeks ago

are we currently dockerized? If not, do we need me to build an image with the binary and push it to an ecr repo?

Not yet. Could push it to ECR or Docker Hub.

env vars passed to proxy (how I will have to configure settings)?

Currently flags only as that is the easiest thing to support in Go.

Usage of ./mdmproxy:
  -auth-token string
        Auth token for remote flag updates (remote updates disabled if not provided)
  -existing-hostname string
        Hostname for existing MDM server (eg. 'mdm.example.com') (required)
  -existing-url string
        Existing MDM server URL (full path) (required)
  -fleet-url string
        Fleet MDM server URL (full path) (required)
  -migrate-percentage int
        Percentage of clients to migrate from existing MDM to Fleet
  -migrate-udids string
        Comma-delimited list of UDIDs to migrate always
  -server-address string
        Address for server to listen on (default ":8080")

Here's an example of how I'm invoking it:

./mdmproxy --migrate-udids '' --auth-token foo --existing-url https://3.134.193.249 --existing-hostname micromdm.daveherder.com --fleet-url https://migration-test.cloud.fleetdm.com --migrate-percentage 0

Do I need specialized IAM permissions for any assume-role type tasks?

No I don't think so. It just needs to be a server listening over HTTP (exposed via LB as HTTPS).

Does it support parallel instances? How many containers would we need in the service relative to the number of fleet containers?

It should, yes (there is no state). I'm guessing we could do it in a single instance with a bit of vertical scaling though if that's easier.

zayhanlon commented 2 weeks ago

@rfairburn i'm okay to put this at the top of your list for Monday

zwass commented 2 weeks ago

I'm going to put https://github.com/fleetdm/fleet/pull/19779 up for review as I imagine having it in the repo will make it easier to build on top of it.

fleetdm / fleet