Add a mechanism to enable leader-election when handling migrations

edwardsb commented 3 months ago

Problem

We are facing challenges with the current database migration strategy for Fleet device management in a complex deployment environment. Our infrastructure requires that services remain online at all times, making it difficult to scale services down to zero instances for migrations. The existing procedure mandates taking the servers offline to run migrations, which conflicts with our operational requirements and can lead to service disruptions.

Summary

This issue arises from a discussion with a customer regarding database migrations for Fleet device management. The customer has a complex deployment strategy where scaling services down to zero is difficult and undesirable. The current upgrade strategy for Fleet involves taking the existing servers offline and running database migrations using the Fleet application. However, the customer's infrastructure requires services to be up at all times, making it challenging to follow this procedure.

Context

The current upgrade strategy for Fleet involves:

Taking the existing servers offline.
Running database migrations using the command fleet prepare db.

The customer outlined the following concerns and limitations:

They cannot limit the migrations to a single instance.
They may not be able to restrict traffic during migrations.
They cannot scale the service to zero instances due to the nature of their infrastructure.

Discussion Highlights

Customer's Understanding and Challenges:
- Migrations require a single instance to run them and no traffic against any of the existing instances.
- There is no coordination between instances if multiple instances attempt to migrate simultaneously.
- The customer cannot scale down to a single instance or restrict traffic.
Current Workarounds:
- The customer is considering running Fleet "somewhere else" with access to the DB to perform migrations.
- Fleet team shared their internal procedure: scaling down the service, running migrations on a single task, and then scaling back up.
Feature Request:
- The customer suggested implementing a distributed locking mechanism to coordinate migrations. This mechanism would ensure that only one instance performs the migration, even in a distributed environment.

Proposed Solution

Implement a distributed locking mechanism to coordinate database migrations. This could be achieved using:

Redis Distributed Lock:
- Utilize Redis to implement a distributed lock. Instances would attempt to acquire the lock before running migrations. Only the instance that successfully acquires the lock would proceed with the migration process.

[!NOTE] https://redis.io/docs/latest/develop/use/patterns/distributed-locks/ https://redis.io/glossary/redis-lock/

MySQL with SKIP LOCKED:
- Use MySQL's SKIP LOCKED feature (available from version 8.0) to implement a distributed lock. Instances would attempt to acquire a lock by querying a specific table/row with the SKIP LOCKED clause. The instance that successfully acquires the lock would proceed with the migration.

[!IMPORTANT]
SKIP LOCKED is only available at or above MySQL 8.0

Important Considerations

Regardless of whether the application succeeds or fails to acquire the distributed lock, it must respond to health checks. This ensures that container orchestration systems like ECS or Kubernetes do not kill services that are running migrations or in the backoff loop after failing to obtain the lock.

Benefits

Ensures only one instance performs the migration, preventing conflicts and ensuring database consistency.
Allows migrations to be run without taking the entire service offline.
Provides a robust solution for environments where scaling down instances or restricting traffic is not feasible.

noahtalerman commented 3 months ago

Thanks for tracking this @edwardsb! The way you outlined the concerns is great.

Looking at usage statistics, customer-ufa is on a relatively old version of Fleet so they are struggling with upgrading Fleet.

The customer outlined the following concerns and limitations:

They cannot limit the migrations to a single instance.

They may not be able to restrict traffic during migrations.

They cannot scale the service to zero instances due to the nature of their infrastructure.

Ben, do you know why they can't limit he migrations to a single instance? And why they might not be able to restrict traffic during migrations?

This sounds similar to the "Maintenance mode: Migrate Fleet database while continuing to pass host logs to log destination" story here which I think we decided not to do because taking the Fleet server down is the best practice.

For the "Maintenance mode" request, customer-blanco didn't want to take down Fleet so query data would keep flowing during migrations.

I'm adding the engineering-initiated label so that it gets on to @lukeheath's queue of eng-initiated requests.

Luke, what do you think the best path forward is here? Maybe you, I, Ben, and Zay jump on a call to discuss?

lukeheath commented 3 months ago

@noahtalerman Thanks for the heads up. This is something we've looked at quite a bit, and as you said the current best practice is to bring the Fleet server offline during migration.

I think https://github.com/fleetdm/fleet/issues/16704 would make this smoother, because users would get a clear message at the Fleet UI instead of a 5xx, and critical security telemetry would still flow from hosts without interruption.

Ideally, we'd run migrations while staying online using a blue/green deploy or a distributed lock outlined here. It's difficult to prioritize this as engineering-initiated, however, because it would take more than 10% of the sprint by itself. Given the level of effort, this would need to be a strategically planned alongside other product deliverables.

fleetdm / fleet