Closed patrickm02L closed 11 months ago
@patrickm02L and I discussed this today in the office, and a t3.small is $50 a month, a t3.medium is $77, and a t3.large is 131. We need to have a size choice for both the production server and also its read replica, so if we want to make them match in capacity, multiply those numbers by two.
I recommend a db.t3.medium
for VZ, which is what we're operating moped on, for reference.
I think we should actually consider this newer service offered by AWS. It's a postgres compatible service just like we've always used, but it can grow and shrink in capacity based on its load automatically. I used it with my discovery day experiment, and it was a drop in RDS solution. I believe at our current DB load, we may get a price drop, and we'd certainly welcome a speed increase if it can bring one. Big if, but worth a shot!
The main concern is performance. Please make the call on which one would be more beneficial. If we need to we can change it later.
@patrickm02L thanks for that clarification. The serverless solution I described above is able to grow and shrink based on need, so it should address performance as well. That's going to need to be tested to be verified, but that's my hope.
I mentioned this super briefly in refinement today, and the next step will be to create a test serverless PG instance and load it up with data and start lobbing queries at it and see how it does.
Spoke with Frank, and he suggested setting up a version of VZ with Amazon Aurora on data03 for Devs to test on - there isn't code to test per say, but we can do a "feel" test to get an understanding of performance and speed.
We'll create a PR as a placeholder for feedback and getting sign-off; this will allow the team to compare Amazon Aurora to our current Amazon RDS cluster and understand its performance and analytics.
Thanks for the update @patrickm02L! Thanks also for testing the evaluation VZE earlier with me.
I have a test instance setup that I'm sharing right now in slack with folks. I'll drop a link to this issue there too so people can add any comments they have. This test instance is a installation of the VZE backed by the new database solution and it can be directly compared to our production VZE.
I'm asking folks to give them a test, and if they like the performance, then we set it up in our production AWS account and move forward getting it installed on our shared AWS account, so people can evaluate the backend analytics too. I'll make a PR for formal approval when I set it up on the DTS account.
then we set it up in our production AWS account and move forward getting it installed on our shared AWS account
Roughly how long will getting it set up in the production AWS account take? is it like half a day of configuration? an hour? And just to check that im understanding this, after you set it up in our AWS account as a test run, you would then have to set it up as the VZ prod database, right? Does that involve setting the whole stack up again, or would it be more like flipping a setting?
Here's what rolling it out would look like I think:
We have some hard data now to compare the Aurora DB performance against our t3.medium
that we installed last night.
The fatalities view is CPU bound, so it's representing the speed of a single core in the two instances, not the database capacity per se. I believe we can work on the view that drives the fatalities page and bring it into the same relative speed regime that we see in our other listing pages. This work will become a spin off issue.
This migration was begun on Sunday, Oct 1. While the transfer was successful, and I was able to use the production VZE backed by the Aurora DB, I found that the read replica and the read/write database instances are required to share the same Security Groups and public IP configurations. This disables the mechanism we've used in the past to differentiate the two endpoints and prevent write access to the production database from all but the graphql-engine
instance.
Putting the primary read/write database on the open internet, even if secured to only accept connections from the COA network and certain home IPs is a no-go. Anyone with a read-replica account would be able to connect to the read/write host due to the credentials being shared between the services.
The solution is to leave both services without public IPs. Everything that needs to get to the primary server will be able to as it's only the bastion host & the graphql-engine
service. The read replica will be exposed on the internet via a TCP proxy running on the bastion host.
This can be done at least two ways.
iptables
- this was my first thought, and it's my least preferred. iptables
is challenging to manage, hard to read, and easy to get wrong enough to lock you out of a server.haproxy
- this is how I want to go. HAProxy (High Availability Proxy) was built to solve this problem and also offer us tons of flexibility in the future. I am going to set this up as a docker compose
stack on the bastion such that it forwards 5432
on it's internet facing NIC to the read-replica Aurora instance. Then by moving our read-replica
CNAME to point at the IP of the bastion, this will actually be a transparent move to our users and we have all of the configurability that HAPRoxy brings to the table to manage that proxy. I intend to prove up this solution today and install it tonight and complete this issue's task.
I'm transferring over the production database now, which takes at least half an hour, so this is a good time to start some documentation on the process.
First, create a PG Aurora cluster.
Next, execute a whole bunch of commands that look like this to establish roles on the server. We're bringing over all except Adrien's. I'm omitting everyone's handle so we don't post our names anymore than we have to.
# ❗ NB: This command is running locally on my laptop with a SSH tunnel providing `5434` to the new Aurora DB
echo "CREATE ROLE herefordf LOGIN PASSWORD '<snip>';" | \
docker run \
-e PGHOST=host.docker.internal -e PGPORT=5434 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1
Make sure the staff
role is granted to everyone:
# this, but for everyone
grant staff to herefordf;
Make sure vze
has everything. Run for both staging and production. Only production is shown here:
GRANT ALL PRIVILEGES ON DATABASE atd_vz_data TO vze;
-- Grant all privileges on all tables in the public schema
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO vze;
-- Grant all privileges on all sequences in the public schema
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO vze;
-- Grant all privileges on all functions in the public schema
GRANT ALL PRIVILEGES ON ALL FUNCTIONS IN SCHEMA public TO vze;
After that, the technique is to make sure there's an empty database on the target Aurora cluster and use the standard postgresql tooling to transfer the data. Something like this:
Staging transfer:
# ❗ From here on out, running on the bastion host
echo "drop database if exists atd_vz_data_staging with (force);" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1
echo "create database atd_vz_data_staging;" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1
docker run \
-e PGHOST=vision-zero.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data_staging \
-it --log-driver none postgres:latest pg_dump | \
perl -pe 's/^(SELECT pg_catalog\.set_config)/-- $1/g' | \
perl -pe 's/^(.*haighta.*)/-- $1/g' | \
perl -pe 's/^(.*(pgis_geometry_union_finalfn|st_clusterkmeans).*)/-- $1/g' |
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data_staging \
-i --log-driver none postgres:latest psql -v ON_ERROR_STOP=1
and the production transfer:
echo "drop database if exists atd_vz_data with (force);" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1
echo "create database atd_vz_data;" | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=postgres \
-i postgres:latest psql -v ON_ERROR_STOP=1
docker run \
-e PGHOST=vision-zero.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data \
-it --log-driver none postgres:latest pg_dump | \
perl -pe 's/^(SELECT pg_catalog\.set_config)/-- $1/g' | \
perl -pe 's/^(.*haighta.*)/-- $1/g' | \
perl -pe 's/^(.*(pgis_geometry_union_finalfn|st_clusterkmeans).*)/-- $1/g' | \
docker run \
-e PGHOST=vision-zero-aurora-instance-1.<snip>.us-east-1.rds.amazonaws.com -e PGPORT=5432 -e PGUSER=postgres \
-e PGPASSWORD=<snip> \
-e PGDATABASE=atd_vz_data \
-i --log-driver none postgres:latest psql -v ON_ERROR_STOP=1
The perl inline regex translations include:
set_config
)Be sure to include the --log-driver none
or your docker containers will try to log the entire pg_dump
output both on the way out and back into the other container. ☠️
With a modest EC2 machine on network with both instances, this is an hour and a half to two hours in duration.
db.vision-zero.austinmobility.io
to CNAME to the new DB cluster primary server
graphql-engine
, it checks very frequently. db-rr.vision-zero.austinmobility.io
to CNAME to the bastion host
Vision Zero Database
and Vision Zero Database Configuration
1PW entries.
graphql-engine
👍 cris_import.py
test import via Airflow 👍 Baring anything unexpected tomorrow, I think this is taken care of. Thank you to everyone for your help!
This issue was originally to grow our RDS cluster to the fixed size of a
t3.medium
instance.(below issue updates written by Frank)
In the course of the last week's discovery day, I learned about the PG compatible RDS aurora service, which is a dynamically scalable PG solution that fits out use case well. This idea was floated very briefly in a weekly meeting, and it was indicated that this was worth investigating. The spirit of this idea is to pursue the most performance possible balanced against cost with the flexibility of dynamic scalability.
This issue is rescoped to include this investigation. Presently, I have running a test fixture consisting of: