elfhosted / enhancements

This repository collects "Elf Enhancement Proposals" (EEPs!)
1 stars 0 forks source link

Solicit and incororporate feedback on outage management process post-NFS-bug #8

Open funkypenguin opened 3 months ago

funkypenguin commented 3 months ago

This issue captures feedback and learnings from the recent outage caused by the NFS / Cilium bug, with a mind to improving our processes for inevitable future issues. I'm looking for a list of issues only (not going to debate as to why / what happened, just looking for a "bucket" for all feedback to go into while it's fresh, so that we don't loose the value).

Please post feedback / observations / suggestions below :)

funkypenguin commented 3 months ago

One issue identified is that Gatus can get very noisy when restarting apps outside of the scheduled maintenance window, and there's no global "off" switch, since emails are sent from each tenant Gatus instance. A possible option would be a "global off switch" for the account that Gatus uses to send the emails (through mailgun)

mikesrespository commented 3 months ago

Can there be an email notification (e.g. the message put in elf announce) sent to all users when this begins to notify non discord users?

legionsystems commented 3 months ago

Where possible a more staged approach would be ideal, So that it doesn't impact all users all at once for these things. Migrating symlinks was done to a pilot group - so in this instance it could have been done over a few weeks / nights and moving say 30% of the workload at a time to the new storage.

funkypenguin commented 3 months ago

yeah, that's a nice idea.. we already split users into 26 groups alphabetically for sharding of the flux reconciliations, that might help us to apply changes to smaller sample set in future...

mikesrespository commented 3 months ago

It might make sense to split users by the last 2 digits of account/subscription numbers so you don't have to worry about splitting the alphabet manually.