Open cgolubi1 opened 1 year ago
I've been thinking about this for awhile, but it has possibly become an emergency thing, because a set of forced upgrades and other misadventures has led to my not currently having a working vagrant install. I'm experimenting with other ways to get vagrant up and running, because that's obviously the shorter-term patch, but it's tempting to not waste a crisis, and just move over to where i wish we were.
In theory (in theory, theory and practice are the same), CircleCI tells us that fundamentally we can build the site, including a local DB for sites that use that, and run the code, using a dockerfile. So for dev and replay sites, all we should need is:
deploy/vagrant
in the codebase currently does for us (apache configuration, postfix, cron jobs, cloudwatch metrics, mostly, i think)Staging and prod have some extra config to talk to RDS (which can presumably be dockerized trivially as well), and use elastic IPs (which i'm not sure we can attach to containers directly, so we might have to stick them behind load balancers).
But all of that strikes me as basically doable.
Okay, so: as far as i know, we've resolved the hard blockers for prod/staging. The only way to find out for sure is to try this out on staging. So i'm going to work towards that now --- i'll cut a new issue for deploying and testing that change.
Okay, so i went down quite a rathole on this cron error thing.
Here's the situation:
/proc
filesystem can't be used to determine process status within a container the way it can be on an instance:
$PID
is the process ID of a running process: on an instance, sudo readlink /proc/${PID}/exe
yields the executable path of the running process; on a container, it doesn't.ps
does work --- so you can get information about processes running in the container, you just can't get it via the /proc
pseudofs per sestart-stop-daemon
, which is the underlying process that /etc/init.d
scripts use to manage processes in our OS, relies on readlink /proc/${PID}/exe
to tell if a process is running. In other words, on our containers right now, you can't run /etc/init.d/${service} stop
for any service, and you can't do anything else that does that under the hood:
rsyslog
's logrotate configuration --- rsyslogd
wants to restart itself (really, send itself SIGHUP) as a post-rotate action, it uses init scripts to do that, so it doesn't work./etc/init.d/apache2 restart
(reload
does work).init.d
scripts can't stop processes, but they can start processes just fine. So we can and do use /etc/init.d/${service} start
to start our services at container launch time, and could use it again later if we knocked over a process.systemd
(aka /sbin/init
) and use that to start up a variety of services for a variety of reasons. Our containers don't --- on startup, they just start the set of services explicitly listed in deploy/docker/startup.sh
in the codebase. This isn't directly related to this root cause, i'm just mentioning it because it is both:
/proc
access thing --- if a bunch of services that aren't important to us depend deeply on /proc
, we don't actually need to care about that, because we're not running those services on container sites. We only need to make sure the services we actually depend on (apache, mysql (for dev sites), cron, syslog, ssh, postfix) work./var/log/syslog
gets moved to /var/log/syslog.1
and a new /var/log/syslog
is created, but without the kill -HUP
which the post script tries to do, syslog keeps writing to /var/log/syslog.1
sudo -u syslog killall -HUP --user syslog
does work. So all we need to do is make that the post-rotate action, instead of trying to use the init.d script.So my recommendation is that we make that small fix so that syslog rotation will do what we expect it to, and that we be aware of the lack of access to /proc
and corresponding failure of init scripts to behave how we'd expect on stop, but don't do anything about it because we don't actually have a use case for needing those things.
Okay, i have a branch pushed which makes that change [1]
, and i'll monitor it for a few days until it rotates the logs, and assuming it does the right thing, i'll put up a PR.
[1]
https://2908-install-systemd.cgolubi1.dev.buttonweavers.com/ui/ if you need it, but you don't, because i didn't make any changes impacting the website --- and, yes, the branch is somewhat misnamed, we're not actually doing anything with systemd. I'm planning to not lose sleep over it.
I'm not 100% confident in my validation of the log rotation solution, but:
$ ls -lart /var/log/syslog*
-rw-r----- 1 syslog adm 3319 May 6 06:39 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm 916 May 7 06:39 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm 901 May 8 06:39 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm 8883 May 9 06:39 /var/log/syslog.1
-rw-r----- 1 syslog adm 2545 May 9 13:39 /var/log/syslog
So i'm going to call that good enough and PR my solution.
We're using vagrant to manage prod/staging/dev/replay buttonmen sites on EC2 instances (we use docker-based circleci to run regression tests). This has served us okay so far, but, since we committed to the vagrant+instance path, docker+container has kind of become the done thing. There are a couple of specific reasons we might want to migrate:
vagrant-aws
plugin, which we totally rely on, is not being maintained. We should move away from that dependency.This issue is to make and implement a plan for migrating site installs from vagrant+EC2 to docker+ECS.
Problems that need to be addressed (this list is incomplete):