Open edwardsb opened 3 months ago
Thanks for tracking this @edwardsb! The way you outlined the concerns is great.
Looking at usage statistics, customer-ufa
is on a relatively old version of Fleet so they are struggling with upgrading Fleet.
The customer outlined the following concerns and limitations:
- They cannot limit the migrations to a single instance.
- They may not be able to restrict traffic during migrations.
- They cannot scale the service to zero instances due to the nature of their infrastructure.
Ben, do you know why they can't limit he migrations to a single instance? And why they might not be able to restrict traffic during migrations?
This sounds similar to the "Maintenance mode: Migrate Fleet database while continuing to pass host logs to log destination" story here which I think we decided not to do because taking the Fleet server down is the best practice.
For the "Maintenance mode" request, customer-blanco
didn't want to take down Fleet so query data would keep flowing during migrations.
I'm adding the engineering-initiated label so that it gets on to @lukeheath's queue of eng-initiated requests.
Luke, what do you think the best path forward is here? Maybe you, I, Ben, and Zay jump on a call to discuss?
@noahtalerman Thanks for the heads up. This is something we've looked at quite a bit, and as you said the current best practice is to bring the Fleet server offline during migration.
I think https://github.com/fleetdm/fleet/issues/16704 would make this smoother, because users would get a clear message at the Fleet UI instead of a 5xx, and critical security telemetry would still flow from hosts without interruption.
Ideally, we'd run migrations while staying online using a blue/green deploy or a distributed lock outlined here. It's difficult to prioritize this as engineering-initiated, however, because it would take more than 10% of the sprint by itself. Given the level of effort, this would need to be a strategically planned alongside other product deliverables.
Problem
We are facing challenges with the current database migration strategy for Fleet device management in a complex deployment environment. Our infrastructure requires that services remain online at all times, making it difficult to scale services down to zero instances for migrations. The existing procedure mandates taking the servers offline to run migrations, which conflicts with our operational requirements and can lead to service disruptions.
Summary
This issue arises from a discussion with a customer regarding database migrations for Fleet device management. The customer has a complex deployment strategy where scaling services down to zero is difficult and undesirable. The current upgrade strategy for Fleet involves taking the existing servers offline and running database migrations using the Fleet application. However, the customer's infrastructure requires services to be up at all times, making it challenging to follow this procedure.
Context
The current upgrade strategy for Fleet involves:
fleet prepare db
.The customer outlined the following concerns and limitations:
Discussion Highlights
Customer's Understanding and Challenges:
Current Workarounds:
Feature Request:
Proposed Solution
Implement a distributed locking mechanism to coordinate database migrations. This could be achieved using:
SKIP LOCKED
feature (available from version 8.0) to implement a distributed lock. Instances would attempt to acquire a lock by querying a specific table/row with theSKIP LOCKED
clause. The instance that successfully acquires the lock would proceed with the migration.Important Considerations
Benefits