migrate away from vagrant to deploy buttonmen

cgolubi1 commented 1 year ago

We're using vagrant to manage prod/staging/dev/replay buttonmen sites on EC2 instances (we use docker-based circleci to run regression tests). This has served us okay so far, but, since we committed to the vagrant+instance path, docker+container has kind of become the done thing. There are a couple of specific reasons we might want to migrate:

Support for vagrant itself is okay, but the vagrant-aws plugin, which we totally rely on, is not being maintained. We should move away from that dependency.
Building on containers (and moving to a container-based platform like ECS) would ultimately let us do some cool things, mostly around automating parts of the replay/dev testing lifecycle, and taking advantage of support for dormant containers (so that i would only have to pay for dev sites when someone was actively using them). We'd have to do a bit of work to get that stuff, but moving to containers is step 1 of that work

This issue is to make and implement a plan for migrating site installs from vagrant+EC2 to docker+ECS.

Problems that need to be addressed (this list is incomplete):

Problems which are NOT blockers for migrating staging and prod:
- [x] Support replay sites - #2952
- [ ] When ECS replaces a dev site, the database is lost
- [ ] When ECS replaces a dev site, DNS needs to be updated
- [ ] Fix the sed -i thing that yields: deploy/vagrant/manifests/init.pp-e
- [ ] Make a way to periodically run: docker system prune (or detect that this is needed)
- [x] Setup cloudwatch logging of container startup (awslogs?) - #2929
- [ ] Look into using a real process manager in the container / understand support for clean shutdowns on deployment - e.g. https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/
- [x] When a new container comes up, it should e-mail people who may need to SSH in to notify them about its public IP address - #2941
- [x] Mount /var/log/apache2 from the EFS volume so we retain apache logs across containers - #2942
- Problems which ARE blockers for migrating staging and prod:
- [x] Merge #2925
- [x] Support for associating an EIP with a site (via NLB, probably) - #2927
- [x] When ECS replaces a site with an EIP, rerun certbot - #2935
- [x] When ECS replaces a site with a remote database, buttonmen_db.cnf is lost - #2930
- [x] Support attaching an FS to a site so that backups can be stored persistently - #2935
- [ ] Fix the private-IP-specific mysql grants for bmuser1 on the staging and prod DBs (done for staging)
- [x] When ECS replaces a site with a remote database, set_buttonmen_config must be run - #2931
- [x] Five deployment a week limit imposed by letsencrypt - #2935
- [ ] Fix logrotate cron job failure described in https://github.com/buttonmen-dev/buttonmen/issues/2937#issuecomment-2045198136

cgolubi1 commented 1 year ago

I've been thinking about this for awhile, but it has possibly become an emergency thing, because a set of forced upgrades and other misadventures has led to my not currently having a working vagrant install. I'm experimenting with other ways to get vagrant up and running, because that's obviously the shorter-term patch, but it's tempting to not waste a crisis, and just move over to where i wish we were.

cgolubi1 commented 1 year ago

In theory (in theory, theory and practice are the same), CircleCI tells us that fundamentally we can build the site, including a local DB for sites that use that, and run the code, using a dockerfile. So for dev and replay sites, all we should need is:

A dockerfile with the rest of the stuff that deploy/vagrant in the codebase currently does for us (apache configuration, postfix, cron jobs, cloudwatch metrics, mostly, i think)
A straightforward rig for running docker and pushing the results to ECR
An ECS task definition to launch that container image

Staging and prod have some extra config to talk to RDS (which can presumably be dockerized trivially as well), and use elastic IPs (which i'm not sure we can attach to containers directly, so we might have to stick them behind load balancers).

But all of that strikes me as basically doable.

cgolubi1 commented 3 months ago

Okay, so: as far as i know, we've resolved the hard blockers for prod/staging. The only way to find out for sure is to try this out on staging. So i'm going to work towards that now --- i'll cut a new issue for deploying and testing that change.

cgolubi1 commented 2 months ago

Okay, so i went down quite a rathole on this cron error thing.

Here's the situation:

Because of a combination of ways in which containerization is different from virtualization, and the details of the permissions of our containers, the /proc filesystem can't be used to determine process status within a container the way it can be on an instance:
- Example: if $PID is the process ID of a running process: on an instance, sudo readlink /proc/${PID}/exe yields the executable path of the running process; on a container, it doesn't.
- Note that ps does work --- so you can get information about processes running in the container, you just can't get it via the /proc pseudofs per se
- Could this be fixed? Unclear. Note that the solution will be ECS-specific, and note that it claims to be a permissions issue, but it's just as much an issue of how processes actually work in containerization, and basically just "is this aspect of the magic implemented?" Suffice to say i've looked at the docs a bunch and done a bunch of googling, and i don't see an obvious fix, but i also haven't gotten to a "no, this is not supported" result.
The most immediate reason we care about this, is that start-stop-daemon, which is the underlying process that /etc/init.d scripts use to manage processes in our OS, relies on readlink /proc/${PID}/exe to tell if a process is running. In other words, on our containers right now, you can't run /etc/init.d/${service} stop for any service, and you can't do anything else that does that under the hood:
- This is fundamentally why there's a problem with rsyslog's logrotate configuration --- rsyslogd wants to restart itself (really, send itself SIGHUP) as a post-rotate action, it uses init scripts to do that, so it doesn't work.
- It also means that we can't run /etc/init.d/apache2 restart (reload does work).
- However note that actual apache log rotation does not rely on restarting apache, so our apache logs, which are the most important logs we need to rotate, are being rotated.
- Also note that init.d scripts can't stop processes, but they can start processes just fine. So we can and do use /etc/init.d/${service} start to start our services at container launch time, and could use it again later if we knocked over a process.
As a side note: our instance-based sites run systemd (aka /sbin/init) and use that to start up a variety of services for a variety of reasons. Our containers don't --- on startup, they just start the set of services explicitly listed in deploy/docker/startup.sh in the codebase. This isn't directly related to this root cause, i'm just mentioning it because it is both:
- Another source of potential differences between instance-based sites and container-based sites, that's worth being aware of
- A reason to be less worried about the /proc access thing --- if a bunch of services that aren't important to us depend deeply on /proc, we don't actually need to care about that, because we're not running those services on container sites. We only need to make sure the services we actually depend on (apache, mysql (for dev sites), cron, syslog, ssh, postfix) work.
Do we ultimately care about the rsyslog thing? Actually, sort of:
- The issue is that the rotation of rsyslog's own logs, i.e. stuff in /var/log, doesn't work right --- so e.g. /var/log/syslog gets moved to /var/log/syslog.1 and a new /var/log/syslog is created, but without the kill -HUP which the post script tries to do, syslog keeps writing to /var/log/syslog.1
- The good news is that if that's all we care about, there's an easy fix: sudo -u syslog killall -HUP --user syslog does work. So all we need to do is make that the post-rotate action, instead of trying to use the init.d script.

So my recommendation is that we make that small fix so that syslog rotation will do what we expect it to, and that we be aware of the lack of access to /proc and corresponding failure of init scripts to behave how we'd expect on stop, but don't do anything about it because we don't actually have a use case for needing those things.

cgolubi1 commented 2 months ago

Okay, i have a branch pushed which makes that change [1], and i'll monitor it for a few days until it rotates the logs, and assuming it does the right thing, i'll put up a PR.

[1] https://2908-install-systemd.cgolubi1.dev.buttonweavers.com/ui/ if you need it, but you don't, because i didn't make any changes impacting the website --- and, yes, the branch is somewhat misnamed, we're not actually doing anything with systemd. I'm planning to not lose sleep over it.

cgolubi1 commented 1 month ago

I'm not 100% confident in my validation of the log rotation solution, but:

The 2908-install-systemd dev container hasn't sent any logrotate-related cron error mail since the weekend, and other containers have

There's positive evidence that some log in /var/log is being rotated correctly, and no evidence that any logs are being rotated incorrectly:

$ ls -lart /var/log/syslog*
-rw-r----- 1 syslog adm 3319 May  6 06:39 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm  916 May  7 06:39 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm  901 May  8 06:39 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm 8883 May  9 06:39 /var/log/syslog.1
-rw-r----- 1 syslog adm 2545 May  9 13:39 /var/log/syslog

So i'm going to call that good enough and PR my solution.

buttonmen-dev / buttonmen

migrate away from vagrant to deploy buttonmen #2908