freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
542 stars 150 forks source link

Create new hosted database replica for client #1244

Closed mlissner closed 4 years ago

mlissner commented 4 years ago

For our latest replication client, we will be hosting the database for them. This makes it a lot easier for us to do things like database migrations, because we can see logs on both sides.

Some basic requirements:

  1. It needs to have its own IP address

  2. It needs to have its own hardware and scaling

  3. It needs to continue our strong security posture

There's a couple ways to do this that could work in order of simplicity:

  1. We use AWS replicas.

    • Pros: These are wicked easy to set up. They can scale however needed, and seem to use logical replication under the covers.
    • Cons: However, they replicate all tables, which we cannot do.
    • Verdict: Pass. We can't give clients access to our user tables, which this would do.
  2. We can set up a separate database in our AWS account.

    • Pros: Fairly simple. Pretty similar to our other replication clients.
    • Cons: AWS limits you to 20 DBs, so that could one-day be an issue.
    • Verdict: Maybe!
  3. We could set up entirely separate AWS accounts with separate RDS instances.

    • Pros: Fully independent system. Exactly how our other clients are set up. No scaling limitations I can think of. Networking firewall rules already established for other customers.
    • Cons: Complex. It'll be a pain jumping between accounts all the time. Will need to create and manage lots of AWS accounts, which comes with who knows what. There would be ingress and egress fees for data that we replicate (such fees don't apply in option 1 or 2)
    • Verdict: Seems overly complex.

I actually like option three a lot because of how clean it is. There are no real differences between this client and our previous ones, except that we handle the email address. I think longer term though I'd rather have everybody in option two. No need for extra AWS accounts; lower fees; etc. It's actually simpler in general.

The only other piece of this that's annoying is that regardless of the option we take, we'll have to set up another EC2 instance to run a proxy server so that we have a static IP address for the DB.

mlissner commented 4 years ago

Creating the DB

So far, nothing too exciting here. A couple parameters to take note of though:

mlissner commented 4 years ago

Networking

Generally this went well. The general architecture is:

The trick is to set this all up from bottom to top and to have it all work at the end. You have to do it that way so each piece can connect to the piece before.

The RDS instance

See above. No major tricks here.

EC2 & HAProxy

This is running on an EC2 micro instance that's built from a saved AMI. Just launch that AMI, go into the proxy settings, and tweak those settings to point to the RDS instance. Do not create new keys or use existing ones; the aws-replication-keys.pem key is built into the AMI.

Restart HAProxy for good measure. Note that the AMI has full bash history that's useful to look at.

ELB Target Groups

Create a target group on port 5432 and register the EC2 instance as a target.

Elastic Load Balancer

Set up an ELB using the ones that are there. It will use the target group created above via a listener on port 5432.

Route 53

This part is easy. Just set up an A record as an alias to the ELB. Remember that subdomains aren't private, so use codenames here if needed.

mlissner commented 4 years ago

A couple notes on IP addresses, having tried several things today:

  1. For some reason, when an RDS DB has a public IP address, other RDS DBs will use that to connect to them, even if they have private IP addresses. That's frustrating and I've asked about the issue here.

  2. The ideal and simple solution to this issue would be to ensure that all our RDS instances always had only private IP addresses. Unfortunately, that doesn't really work because we need a public IP in order to replicate to the RDS instance. Without that public IP, our subscription to the master server fails.

  3. To deal with this, we need to recognize that the traffic is going to out of AWS from the RDS publisher and then in to AWS to the subscriber. As such, the VPC rules need to allow outbound traffic from the publisher and inbound traffic to the subscriber. This is annoying, but it's not horrible. It does mean we need very good passwords, but we do that anyway.

When we do IP address changes to scale or otherwise change settings, we need to think about three locations: