accessibility-exchange / platform

The Accessibility Exchange platform.
https://github.com/orgs/accessibility-exchange/projects/2/views/8
BSD 3-Clause "New" or "Revised" License
4 stars 10 forks source link

During rolling deploy it is possible for the old application pod to interact with the updated database #1867

Open jobara opened 1 year ago

jobara commented 1 year ago

Prerequisites

Describe the bug

In our current rolling deploy system, as new pods are being deployed an old pod sticks around until the new ones are ready for use. However, there is a single shared database that the pods connect to. The issue here is that a user may be interacting with the old pod, but the database could have been migrated to a new structure. This could lead to data corruption and/or 500 errors reported to the user as the application may have a mismatch of expectations of the data compared to the current database.

Expected behavior

We should minimize or eliminate the possibility of the old application and new database from interacting with each other.

colleenskemp commented 1 year ago

This ticket captures the following tickets as related sub-tickets: https://github.com/accessibility-exchange/platform/issues/1728 https://github.com/accessibility-exchange/platform/issues/1686 https://github.com/accessibility-exchange/platform/issues/1550

colleenskemp commented 1 year ago

@jobara - We understand that this is not a priority at this time. Is that right? The sense of our team is that we can turn off the rolling updates, but then we will have downtimes for each deployment. This might not be worth our time.

Do you agree?

jobara commented 1 year ago

@colleenskemp I'll have to think so more on this. I'll check in with @michelled when she's back.

jobara commented 1 year ago

At the dev check in meeting with @JureUrsic, @peterhebert, and @michelled we discussed using Laravel's maintenance mode for this. When the deploy is happening the script would call php artisan down, after the deploy is finished it would call php artisan up. Any users accessing the site during the maintenance time would see a maintenance page.

jobara commented 1 year ago

@JureUrsic I was thinking about this today, and wondering when/where it should run. I was thinking it could go around the migration step in DeployGlobal.php but I'm not sure because wouldn't the old web head need to come down before we take the site out of maintenance mode? Also are you able to take on work on this task?

JureUrsic commented 1 year ago

@jobara it should go into "local" command on start and beginning

JureUrsic commented 1 year ago

I can run some tests on dev, just give me the commands to run

jobara commented 1 year ago

I can run some tests on dev, just give me the commands to run

@JureUrsic thanks, you can use the php artisan down and php artisan up commands. See Laravel's maintenance mode for more information.

jobara commented 12 months ago

@JureUrsic the other day I manually reset the database in the dev deploy. As part of that I put the site in maintenance mode. However, after bringing the site back up using php artisan up the site was removed from maintenance mode; however, for several minutes the site remained inaccessible and returned a 500 error from nginx I believe. So the site actually looked broken for awhile. I'm not sure if this will happen with the plans we have for this ticket, but something to look into along with it.

SantiagoG-Colab commented 12 months ago

@marvinroman

marvinroman commented 12 months ago

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections.

We will take the following actions to fix:

marvinroman commented 12 months ago

@jobara I've made the necessary changes in the branch associated with this issue. Let me know if you want me to create a PR for it?

jobara commented 12 months ago

@marvinroman thanks for working on this. Yes, please file a PR for the changes.

jobara commented 12 months ago

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections.

We will take the following actions to fix:

  • [ ] Create a health check that will bypass maintenance mode.
  • [ ] Put the php artisan down/up in the php artisan deploy:global command.

Regarding the health check, in taking a glance at your branch, it looks like it checks the DB now. But I guess that won't really tell us if the web site is actually served up properly. Is there a way to check different things if the site is in maintenance mode or not?

Regarding turning maintenance mode on/off in the global deploy, will that affect the original instance as well and not just the two new ones that are in the process of spinning up?

jobara commented 12 months ago

@marvinroman also in your branch I noticed that it brings the site back up after 5 minutes. These kinds of timers are always risky as we don't know if the task has yet to complete or completed some time before. Is it possible to get a hook into when the pods are actually being used, and/or when the old pods are all removed?

marvinroman commented 12 months ago

So the problem with maintenance mode currently is that the health check on the pods also gets maintenance mode so the pod is considered unhealthy and the load balancer doesn't forward connections. We will take the following actions to fix:

  • [ ] Create a health check that will bypass maintenance mode.
  • [ ] Put the php artisan down/up in the php artisan deploy:global command.

Regarding the health check, in taking a glance at your branch, it looks like it checks the DB now. But I guess that won't really tell us if the web site is actually served up properly. Is there a way to check different things if the site is in maintenance mode or not?

Regarding turning maintenance mode on/off in the global deploy, will that affect the original instance as well and not just the two new ones that are in the process of spinning up?

This is a health check of the pod and not the site to know whether to forward connections to the pod from the load balancer. In other words are the services properly running. We have an external check that determines site health and will notify us of site issues.

When maintenance mode is activated it occurs across all the pods.

marvinroman commented 12 months ago

I agree that there are risks associated with a timer, but we haven't found an alternative at this time.

We have determined that lifecycle hooks aren't possible to use in our infrastructure at this time.