Migrations on large databases can brick a PaaS deployment

GoogleChrome / lighthouse-ci

Automate running Lighthouse for every commit, viewing the changes, and preventing regressions

Apache License 2.0

6.44k stars 649 forks source link

Migrations on large databases can brick a PaaS deployment #587

Open patrickhulce opened 3 years ago

patrickhulce commented 3 years ago

Describe the bug The startup sequence for the LHCI server means that database migrations happen before the server starts listening on a port. If the database is large and migrations require a new index to be built the platform (like heroku) can kill the server because it "fails to start".

Solution here would be to listen on the port and then just ensure all request handling awaits the migration promise first.

The canary instance was undeployable because of this issue.

somehowchris commented 3 years ago

I'm also interested in this. Not only for platforms like Heroku (or cf), which have a timeout for healthchecking the app but also for k8s.

For example with software like matomo has a maintenance for those upgrades. This means the UI gets disabled, meaning there will be a notice that we are currently under maintenance, then we can run the upgrade via a task (cf) or job (k8s), and after this, disable the maintenance mode. Now for advanced setups, there is a caching method via Redis, but that's a bit overkill sometimes. This allows tracking actions to be saved as raw inputs and then be processed by a matomo worker with the new version.

Would a solution like this be something?

patrickhulce commented 3 years ago

That's essentially what I was thinking of yes @somehowchris. LHCI would just fail all incoming requests with a 503 error until the migration completes.

somehowchris commented 3 years ago

So just a 503 or 503 with a html page for the users?

Split into server and separate task? or just the server running the migration? latter one would go against the 12 factor scalability aspect

patrickhulce commented 3 years ago

So just a 503 or 503 with a html page for the users?

I'm thinking a dead simple 503 Service Unavailable (Migration) response. This is a pretty narrow occurrence that isn't worth a significant amount of effort optimizing IMO.

Split into server and separate task? or just the server running the migration? latter one would go against the 12 factor scalability aspect

Just the server running the migration as it works today.

LHCI server isn't built to scale out of the box and requires heavy customization if scalability is your goal. It's built for ease of use for the common use cases. Splitting would be overkill compared to the other tradeoffs already made in this project (have you seen how the cron jobs work? 😉).