DemocracyClub / EveryElection

:ballot_box_with_check: For recording every election in the UK
https://elections.democracyclub.org.uk/
BSD 3-Clause "New" or "Revised" License
11 stars 14 forks source link

remove per-instance crontab entries #2200

Closed chris48s closed 1 month ago

chris48s commented 2 months ago

Right. So I did some digging. All of these commands actually are running on the EE prod server(s). EE has been restoring itself from a copy of its own backup every night :scream_cat:

I don't think it was the intent that these jobs should have been running on WDIV servers running their own EE instance. We definitely wouldn't have wanted them running snoop for example, and sync_elections is there as an alternative to restoring from backup every night.

Also a bit worried that we have no log aggregation and not really any visibility on these cron jobs, but I'm not going to do anything about those problems here.

This is one of those PRs that asks a lot of questions as well as answering some, so I'm going to leave some comments inline.

chris48s commented 2 months ago

The other big worry I have here is

https://github.com/DemocracyClub/EveryElection/blob/9caa6c31767d3aad6299c0e81a185d3596da1273/appspec.yml#L85-L87

When are these command run? Are we runing this on every deploy, including to production?? If it helps, the original commit doing all this was https://github.com/DemocracyClub/EveryElection/commit/e48dd917e45517800bb3685d2b3654a1eff88990

symroe commented 2 months ago

The other big worry I have here is

https://github.com/DemocracyClub/EveryElection/blob/9caa6c31767d3aad6299c0e81a185d3596da1273/appspec.yml#L85-L87

When are these command run? Are we runing this on every deploy, including to production?? If it helps, the original commit doing all this was e48dd91

Thinking about this more, I expect this was all caused by us copying the WDIV CDK model. Specifically, in WDIV we use replication, and here we don't.

When using replication, you want to do something with the local DB before the instance serves traffic. When using a single remote postgres, you don't. I think I must have been confused between the two models.

chris48s commented 2 months ago

I must have been confused between the two models.

OK.. so upshot for this PR is: remove that line from appspec.yml then?

coveralls commented 2 months ago

Coverage Status

coverage: 70.675% (+0.001%) from 70.674% when pulling 981ddb653dbd67b11be99da5883dc1278e685283 on crontab20243007 into d49959b2270777ca25302836fa8e2fce52e3528b on master.

chris48s commented 2 months ago

Right.

Just for future reference: This was kind of less bad than we first thought. (For reasons) as well as having a shared RDS instance the EE VMs also have a local postgres install. The rebuild_local_db script was running against the local Postgres install, not the shared RDS.


So I removed that line from appspec.yml. My next thought was that this also makes

https://github.com/DemocracyClub/EveryElection/blob/38ad35a793f5600511c1c94630e7570e202f0ece/cdk_stacks/stacks/command_runner.py#L94-L99

wrong because we really want staging to run that against the staging RDS, not the local DB.

Then I realised there is no staging RDS (or at least no staging DB connection credentials). Probably no dev RDS either. So maybe this is actually sort of right I think the next thing is: Do we think it is right that dev/staging are using a local DB not RDS?

chris48s commented 2 months ago

I reckon removing the command from appspec.yml is why the deploy has completely hung as well.

Its probably getting a 500 from the healthcheck endpoint or whatever because the (local) DB is empty

chris48s commented 1 month ago

We agreed that what we would do is:

So here's a quick summary what I've done, just because a bunch of stuff has happened outside of code/github:

I created 2 DBs on the shared development RDS called ee_dev and ee_staging.

Then I restored a dump from prod into each of those 2 DBs.

Then I made 2 users called called ee_dev and ee_staging.

Then I ran:

ee_dev=> GRANT CONNECT ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT USAGE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT CREATE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ee_dev;
ee_dev=> ALTER DATABASE ee_dev OWNER TO ee_dev;

..and the equivalent for the ee_staging user/DB.

(may not have actually happened in that exact order, but that's the most coherent way to present it :smile: )

One of the reasons I mention this is I suspect this is not identical to the way the other DBs/users on this host are set up.

Then in AWS Parameter Store I set the

EE_DATABASE_HOST
EE_DATABASE_NAME
EE_DATABASE_USER
EE_DATABASE_PASSWORD

environment variables for each of the two environments (dev and staging). Dev is confirmed working deployed with those settings. We haven't yet deployed to staging, but should deploy successfully (now that I've got the permissions right on dev) but is as yet untested. As another issue, EE staging can not be accessed due to a CloudFront configuration issue, but that is another issue for another time.

chris48s commented 1 month ago

I'm going to deploy this one first to make sure the staging RDS setup is fine. Then I will do https://github.com/DemocracyClub/EveryElection/pull/2198 second once I've checked that