Closed cgolubi1 closed 1 month ago
Here's my current understanding about how this will work:
buttonmen_ecs_config.json
is knowable. So the next step is to fully fill out all entries for the stage in that file.Ideally, that's it. If anything goes wrong, roll back and revisit (unless it's obvious how to fix it and roll forward, since staging downtime isn't a dealbreaker).
Okay, the pre-migration steps should be done. Let's try this!
So far:
It's still in provisioning state after 3 minutes, so there will be a bit of downtime associated with this item (for prod as well), but hopefully not too much.
Okay, now it's active. Onward.
As predicted, i did something boneheaded:
$ env AWS_PROFILE=bm_deploy /Users/chaos/games/buttonmen/miniconda3/bin/python ./deploy/docker/deploy_buttonmen_site
Traceback (most recent call last):
File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 579, in <module>
deploy(args)
File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 555, in deploy
git_info = get_working_directory_info()
File "/Users/chaos/src/git/buttonmen-dev/buttonmen-staging/./deploy/docker/deploy_buttonmen_site", line 45, in get_working_directory_info
raise ValueError(f"Could not detect repo name from git remote: {output}")
ValueError: Could not detect repo name from git remote: origin git@github.com:buttonmen-dev/buttonmen.git (fetch)
origin git@github.com:buttonmen-dev/buttonmen.git (push)
Working on it...
It's a regexp that won't allow a -
, so if the remote is buttonmen-dev
(rather than e.g. cgolubi1
) it fails. Sure. I'm actually just going to fix that one in my working directory and press forward, in the likely event that that's not the only one.
Okay, now it's rolling. Again, building the docker container takes 10 minutes, and because of the NLB thing, that has to happen as part of the downtime. So one of the things we're doing here is trying to get a bead on how long the downtime will be when we do this for prod, so we can warn people.
In case anyone is worried: for future deployments, that won't matter, because the site will continue cheerfully running the old container while docker does whatever it has to do. It's only an issue for the cutover.
The site is up and running on container as of 19:37 EDT.
So if nothing goes wrong, we should expect about 30 minutes of total downtime.
Okay, the second deployment succeeded with no downtime that i saw. Note that both containers were serving connections for a few minutes, so that's something we need to keep in mind when we think about pushing backwards-incompatible code.
I'm going to bring the old staging site back up temporarily so i can copy over its database backup archive to the EFS volume on the new site. We'll want to do this for prod, so let's make sure it works.
[Edit: never mind --- i lost track of what i was doing with database backups before this migration. They're already on EFS, and in fact on the same FS, i just changed the directory structure. So no need to bring the old site up after the migration, in general.]
logrotate indeed doesn't work on this site - that's something to fix (adding to the checklist on #2908):
Subject: Cron <root@ip-X-X-X-X> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
/etc/cron.daily/logrotate:
invoke-rc.d: action rotate is unknown, but proceeding anyway.
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of rotate.
This issue is for migrating staging.buttonweavers.com as an ECS site for a few weeks, and gathering data about everything that goes wrong so it can be addressed before we migrate the production site.
This is part of #2908