fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 429 forks source link

Maintenance mode for database migrations #4467

Open tgauda opened 2 years ago

tgauda commented 2 years ago

Today when a customer upgrades Fleet there are occasional database migrations that must run. Customers can choose to either take the system offline and perform the upgrade or keep the system online and perform the upgrade in place. If the customer keeps the system online the database migrations can take much longer to complete due to contention. In an enterprise environment taking the system offline can trigger monitors responding to 5xx errors and on-call personnel being notified.

We can solve this dilemma by adding a maintenance mode within Fleet which will temporarily disallow client-facing API usage and return a specific HTTP response that an enterprise can account for in the on-call procedures.

This can be a fleetctl command which can enable or disable this mode.

How?

noahtalerman commented 2 years ago

Making sure that this customer can easily upgrade is a high priority.

I think the proposed solution is a great one. It elegantly adapts the product to support the customer's specific environment.

One downside I see is that the solution is relatively involved (fleetctl, UI, API changes). Not only does this mean that the initial implementation (work now) will take significant effort but it also maybe increases the surface area for bugs and uncaught edge cases (work later).

Here's the way I imagine the way the conversation went: starting with the proposed solution and moving backwards. @tgauda please correct me if I'm wrong.

I could be wrong but the length of the database migrations seems like a large motivator (the main?) for not performing the upgrade in place. Tony, is this correct?

tgauda commented 2 years ago

They can't easily take the system offline because of on-call procedures that are in place. If the system goes down it'll trigger a response and they'd like to avoid that. All the other points are correct.

noahtalerman commented 2 years ago

@tgauda follow up from today's (03-11-2022) call:

The following are immediate steps we

Question 1: Can the customer configure monitoring to not respond to the specific 5xx code that occurs when taking Fleet offline?

Question 2: Can the customer configure/change the load balancer to return a specific code that does not trigger monitoring when taking Fleet offline?

If the answer to both^ is "No," then we'll likely prioritize drafting a solution like the one outlined in this issue.