Switch VZ DB to AWS Aurora

patrickm02L commented 1 year ago

This issue was originally to grow our RDS cluster to the fixed size of a t3.medium instance.

(below issue updates written by Frank)

In the course of the last week's discovery day, I learned about the PG compatible RDS aurora service, which is a dynamically scalable PG solution that fits out use case well. This idea was floated very briefly in a weekly meeting, and it was indicated that this was worth investigating. The spirit of this idea is to pursue the most performance possible balanced against cost with the flexibility of dynamic scalability.

This issue is rescoped to include this investigation. Presently, I have running a test fixture consisting of:

RDS Aurora PG DB with a read replica
a ECS cluster running the latest and greatest version of graphql-engine
a application load balancer able to route over as many instances of the ECS tasks as needed
ACM managed SSL security on the ALB

frankhereford commented 1 year ago

@patrickm02L and I discussed this today in the office, and a t3.small is $50 a month, a t3.medium is $77, and a t3.large is 131. We need to have a size choice for both the production server and also its read replica, so if we want to make them match in capacity, multiply those numbers by two.

I recommend a db.t3.medium for VZ, which is what we're operating moped on, for reference.

frankhereford commented 1 year ago

I think we should actually consider this newer service offered by AWS. It's a postgres compatible service just like we've always used, but it can grow and shrink in capacity based on its load automatically. I used it with my discovery day experiment, and it was a drop in RDS solution. I believe at our current DB load, we may get a price drop, and we'd certainly welcome a speed increase if it can bring one. Big if, but worth a shot!

patrickm02L commented 1 year ago

The main concern is performance. Please make the call on which one would be more beneficial. If we need to we can change it later.

frankhereford commented 1 year ago

@patrickm02L thanks for that clarification. The serverless solution I described above is able to grow and shrink based on need, so it should address performance as well. That's going to need to be tested to be verified, but that's my hope.

I mentioned this super briefly in refinement today, and the next step will be to create a test serverless PG instance and load it up with data and start lobbing queries at it and see how it does.

patrickm02L commented 1 year ago

8/21/23

Spoke with Frank, and he suggested setting up a version of VZ with Amazon Aurora on data03 for Devs to test on - there isn't code to test per say, but we can do a "feel" test to get an understanding of performance and speed.

We'll create a PR as a placeholder for feedback and getting sign-off; this will allow the team to compare Amazon Aurora to our current Amazon RDS cluster and understand its performance and analytics.

frankhereford commented 1 year ago

Thanks for the update @patrickm02L! Thanks also for testing the evaluation VZE earlier with me.

I have a test instance setup that I'm sharing right now in slack with folks. I'll drop a link to this issue there too so people can add any comments they have. This test instance is a installation of the VZE backed by the new database solution and it can be directly compared to our production VZE.

I'm asking folks to give them a test, and if they like the performance, then we set it up in our production AWS account and move forward getting it installed on our shared AWS account, so people can evaluate the backend analytics too. I'll make a PR for formal approval when I set it up on the DTS account.

chiaberry commented 1 year ago

then we set it up in our production AWS account and move forward getting it installed on our shared AWS account

Roughly how long will getting it set up in the production AWS account take? is it like half a day of configuration? an hour? And just to check that im understanding this, after you set it up in our AWS account as a test run, you would then have to set it up as the VZ prod database, right? Does that involve setting the whole stack up again, or would it be more like flipping a setting?

frankhereford commented 1 year ago

Here's what rolling it out would look like I think:

1-2 hours of configuration to setup the new infrastructure
1 hour of copying over the data to the new setup
Changing the DNS entry from the old application load balancer to the new load balancer, which puts it into production.
Communicating the new PG endpoints to Xavier
PR'ing in changes to VZ ETLs which connect directly to the DB to point them to the new service

frankhereford commented 1 year ago

We have some hard data now to compare the Aurora DB performance against our t3.medium that we installed last night.

The fatalities view is CPU bound, so it's representing the speed of a single core in the two instances, not the database capacity per se. I believe we can work on the view that drives the fatalities page and bring it into the same relative speed regime that we see in our other listing pages. This work will become a spin off issue.

patrickm02L commented 12 months ago

9/20

Be sure to alert VZ team about downtime
Let them know about new endpoints
Make sure that DTS knows about new endpoints
Make updates in 1password
Will probably do this after hours

frankhereford commented 11 months ago

This migration was begun on Sunday, Oct 1. While the transfer was successful, and I was able to use the production VZE backed by the Aurora DB, I found that the read replica and the read/write database instances are required to share the same Security Groups and public IP configurations. This disables the mechanism we've used in the past to differentiate the two endpoints and prevent write access to the production database from all but the graphql-engine instance.

Putting the primary read/write database on the open internet, even if secured to only accept connections from the COA network and certain home IPs is a no-go. Anyone with a read-replica account would be able to connect to the read/write host due to the credentials being shared between the services.

The solution is to leave both services without public IPs. Everything that needs to get to the primary server will be able to as it's only the bastion host & the graphql-engine service. The read replica will be exposed on the internet via a TCP proxy running on the bastion host.

This can be done at least two ways.

iptables - this was my first thought, and it's my least preferred. iptables is challenging to manage, hard to read, and easy to get wrong enough to lock you out of a server.
haproxy - this is how I want to go. HAProxy (High Availability Proxy) was built to solve this problem and also offer us tons of flexibility in the future. I am going to set this up as a docker compose stack on the bastion such that it forwards 5432 on it's internet facing NIC to the read-replica Aurora instance. Then by moving our read-replica CNAME to point at the IP of the bastion, this will actually be a transparent move to our users and we have all of the configurability that HAPRoxy brings to the table to manage that proxy.

I intend to prove up this solution today and install it tonight and complete this issue's task.

frankhereford commented 11 months ago

I'm transferring over the production database now, which takes at least half an hour, so this is a good time to start some documentation on the process.

Phase One - Database stuff

First, create a PG Aurora cluster.

Next, execute a whole bunch of commands that look like this to establish roles on the server. We're bringing over all except Adrien's. I'm omitting everyone's handle so we don't post our names anymore than we have to.

# ❗ NB: This command is running locally on my laptop with a SSH tunnel providing `5434` to the new Aurora DB

echo "CREATE ROLE herefordf LOGIN PASSWORD '<snip>';" | \
docker run \
-e PGHOST=host.docker.internal -e PGPORT=5434 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1

Make sure the staff role is granted to everyone:

# this, but for everyone
grant staff to herefordf;

Make sure vze has everything. Run for both staging and production. Only production is shown here:

GRANT ALL PRIVILEGES ON DATABASE atd_vz_data TO vze;

-- Grant all privileges on all tables in the public schema
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO vze;

-- Grant all privileges on all sequences in the public schema
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO vze;

-- Grant all privileges on all functions in the public schema
GRANT ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA public TO vze;

After that, the technique is to make sure there's an empty database on the target Aurora cluster and use the standard postgresql tooling to transfer the data. Something like this:

Staging transfer:

# ❗ From here on out, running on the bastion host
echo "drop database if exists atd_vz_data_staging with (force);" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1

echo "create database atd_vz_data_staging;" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1

docker run \
-e PGHOST=vision-zero.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data_staging \
-it --log-driver none postgres:latest pg_dump | \
perl -pe 's/^(SELECT pg_catalog\.set_config)/-- $1/g' | \
perl -pe 's/^(.*haighta.*)/-- $1/g' | \
perl -pe 's/^(.*(pgis_geometry_union_finalfn|st_clusterkmeans).*)/-- $1/g' |
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data_staging \
-i --log-driver none postgres:latest psql -v ON_ERROR_STOP=1

and the production transfer:

echo "drop database if exists atd_vz_data with (force);" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1

echo "create database atd_vz_data;" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1

docker run \
-e PGHOST=vision-zero.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data \
-it --log-driver none postgres:latest pg_dump | \
perl -pe 's/^(SELECT pg_catalog\.set_config)/-- $1/g' | \
perl -pe 's/^(.*haighta.*)/-- $1/g' | \
perl -pe 's/^(.*(pgis_geometry_union_finalfn|st_clusterkmeans).*)/-- $1/g' | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data \
-i --log-driver none postgres:latest psql -v ON_ERROR_STOP=1

The perl inline regex translations include:

The standard one I have to use every time (set_config)
Ignore grants for Adrien
Ignore two deprecated postGIS functions. We're upgrading from 3.1 to 3.3.

Be sure to include the --log-driver none or your docker containers will try to log the entire pg_dump output both on the way out and back into the other container. ☠️

With a modest EC2 machine on network with both instances, this is an hour and a half to two hours in duration.

Phase 2 - Infra Stuff

Change the DNS entry for db.vision-zero.austinmobility.io to CNAME to the new DB cluster primary server
- No change needed on graphql-engine, it checks very frequently.
Spin up the HAProxy proxy on the bastion host (This is where I hit the wall on Sunday.)
Change the DNS entry for db-rr.vision-zero.austinmobility.io to CNAME to the bastion host
- This is transparent to RR users, so 👍
Update the Vision Zero Database and Vision Zero Database Configuration 1PW entries.
- This is used when ETLS (outside the firewall) need to connect to the RDS instances through the bastion

Phase 3 - Testing

VZE, no changes needed 👍
Read Replica with no end user config changes 👍
No changes to graphql-engine 👍
cris_import.py test import via Airflow 👍

frankhereford commented 11 months ago

Baring anything unexpected tomorrow, I think this is taken care of. Thank you to everyone for your help!

cityofaustin / atd-data-tech