test ECS deployment on staging.buttonweavers.com

cgolubi1 commented 3 months ago

This issue is for migrating staging.buttonweavers.com as an ECS site for a few weeks, and gathering data about everything that goes wrong so it can be addressed before we migrate the production site.

This is part of #2908

cgolubi1 commented 3 months ago

Here's my current understanding about how this will work:

Pre-migration steps:
- I can create the NLB target groups (but not the NLB itself) before the cutover --- i've done so for staging.
- Once the target groups are created, all data needed for buttonmen_ecs_config.json is knowable. So the next step is to fully fill out all entries for the stage in that file.
- Create the PR to bring the new deployment code over to the staging branch, and merge it.
- Create an updated-and-ready-to-go staging checkout with this code, and have the deployment command i'll use ready to hand
At this point, i have to take the staging site offline to go further, because i need to move its EIP to the NLB at NLB creation time. So i think the steps are:
- Stop (do NOT terminate, we need it if we have to roll back) the staging instance
- Create the staging NLB
- Run the deployment

Ideally, that's it. If anything goes wrong, roll back and revisit (unless it's obvious how to fix it and roll forward, since staging downtime isn't a dealbreaker).

cgolubi1 commented 3 months ago

Okay, the pre-migration steps should be done. Let's try this!

cgolubi1 commented 3 months ago

So far:

Stopped buttonmen-staging instance
Disassociated buttonmen-staging EIP
Created NLB

It's still in provisioning state after 3 minutes, so there will be a bit of downtime associated with this item (for prod as well), but hopefully not too much.

cgolubi1 commented 3 months ago

Okay, now it's active. Onward.

cgolubi1 commented 3 months ago

As predicted, i did something boneheaded:

$ env AWS_PROFILE=bm_deploy /Users/chaos/games/buttonmen/miniconda3/bin/python ./deploy/docker/deploy_buttonmen_site
Traceback (most recent call last):
  File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 579, in <module>
    deploy(args)
  File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 555, in deploy
    git_info = get_working_directory_info()
  File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 45, in get_working_directory_info
    raise ValueError(f"Could not detect repo name from git remote: {output}")
ValueError: Could not detect repo name from git remote: origin  git@github.com:buttonmen-dev/buttonmen.git (fetch)
origin  git@github.com:buttonmen-dev/buttonmen.git (push)

Working on it...

cgolubi1 commented 3 months ago

It's a regexp that won't allow a -, so if the remote is buttonmen-dev (rather than e.g. cgolubi1) it fails. Sure. I'm actually just going to fix that one in my working directory and press forward, in the likely event that that's not the only one.

cgolubi1 commented 3 months ago

Okay, now it's rolling. Again, building the docker container takes 10 minutes, and because of the NLB thing, that has to happen as part of the downtime. So one of the things we're doing here is trying to get a bead on how long the downtime will be when we do this for prod, so we can warn people.

cgolubi1 commented 3 months ago

In case anyone is worried: for future deployments, that won't matter, because the site will continue cheerfully running the old container while docker does whatever it has to do. It's only an issue for the cutover.

cgolubi1 commented 3 months ago

The site is up and running on container as of 19:37 EDT.

cgolubi1 commented 3 months ago

So if nothing goes wrong, we should expect about 30 minutes of total downtime.

cgolubi1 commented 3 months ago

Okay, the second deployment succeeded with no downtime that i saw. Note that both containers were serving connections for a few minutes, so that's something we need to keep in mind when we think about pushing backwards-incompatible code.

cgolubi1 commented 3 months ago

I'm going to bring the old staging site back up temporarily so i can copy over its database backup archive to the EFS volume on the new site. We'll want to do this for prod, so let's make sure it works.

[Edit: never mind --- i lost track of what i was doing with database backups before this migration. They're already on EFS, and in fact on the same FS, i just changed the directory structure. So no need to bring the old site up after the migration, in general.]

cgolubi1 commented 3 months ago

logrotate indeed doesn't work on this site - that's something to fix (adding to the checklist on #2908):

Subject: Cron <root@ip-X-X-X-X> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )

/etc/cron.daily/logrotate:
invoke-rc.d: action rotate is unknown, but proceeding anyway.
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of rotate.

buttonmen-dev / buttonmen

test ECS deployment on staging.buttonweavers.com #2937