freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
536 stars 148 forks source link

Create new database replication offering #902

Closed mlissner closed 5 years ago

mlissner commented 5 years ago

One of our clients is tired of the APIs slowness and wants to get right down to business using a replica of our database. This is something we hadn't considered before, but it fits in nicely with our mission since it will give unfettered access to the data in a way that was never previously possible.

I believe it will also help us in a business sense to gain contributors to the API, something that hasn't been as successful as I'd like. It's a pretty common tool in business to go for the goldlilocks approach. This approach says that if you are selling two toasters, a premium and a regular, everybody will buy the regular. Add a super-premium to the shelf and everybody will buy what was previously the premium model, even if its price didn't change at all. In other words, by offering three price points, the middle one will be the one people want.

In a sense right now we have two "products" as described above: Our free API and an API where you contribute to FLP. Most people choose the free one when it comes down to it even though their usage is often pretty big. I think that just by adding a premium database replication feature, a lot more people will start contributing for their API usage. On top of this, obviously, some people will want the replication, which is great too since there's really nothing better than unfettered access to the raw SQL. Some people will continue not contributing and that often makes sense too.

Anyway, I've been researching this. So far I've investigated three options:

  1. Amazon DMS. This option was suggested by the client and on the surface it makes a lot of sense. It's a system within AWS for migrating and then replicating a postgres DB. That's cool and we tried making it work over the past couple days. Some problems we ran into:

    • It has a whole pile of limitations including fun ones like not supporting TRUNCATE or setting default column values.

    • Setting up the networking using AWS's virtual private cloud offerings looked painful.

    • There is just a lot of documentation for things on AWS. After opening ten tabs I started getting pretty annoyed about the black-boxiness of it all.

    • I played with getting it all set up nicely and just found it rather frustrating in all kinds of ways. Web GUIs for server administration suck.

    But on the other hand, it:

    • Lets you migrate from postgres to one of a dozen different DBs. For some orgs that might be pretty important. I'll have to investigate this a bit.
  2. pglogical. I looked at this for a while today too. It:

    • Con: Doesn't support schema migrations.

    • Con: Has pretty bad documentation. It's literally one long HTML page, has typos, etc. It feels like it's written for employees of 2ndQuadrant, who then can walk over to the developer's desk to get clarifications.

    • ~Con: It requires superuser privileges, something that AWS DMS servers don't give you.~ It's true that you don't get superuser privileges, but you pglogical is indeed available in AWS.

    • Con: It requires some kind of postgres extension to be installed, which just feels like perpetual baggage to carry around forever. Not thrilling.

    • Pro: It has column and row filtering, which our next option does not. This could be pretty useful if we want to avoid sharing things with clients.

  3. Native logical replication. As of Postgresql 9.4, the pglogical code has slowly been added to postgres itself. Not all of the features are there, but the ones that are seem solid and the documentation is much better than pglogical.

    • Con: No column filtering. I thought this would be a big deal, but I went back through our models and did an analysis of columns we wouldn't want to share. We can do table filtering. My conclusion was that right now we don't need column filtering. A few of our columns are things we probably don't want to share, like view_count, but the rest are fine. I think this is generally OK.

    • Con: No schema migration. Of the solutions so far, none seem to have this. I thought AWS could figure out it, but nope. Anyway, I feel like by avoiding third party tools we can handle this ourselves. We will have to do this by process instead of code, which is a bit of a bummer, but so be it. I did find one blog post that shows how this might look in practice.

    • Pros: Built in. Getting enhancements (TRUNCATE came in postgres 11, for example). Best documentation.

So I'm leaning towards using postgres's native logical replication feature at this point.

The process for setting up this feature will be something like:

mlissner commented 5 years ago

I'm working on getting a demo database put together to test this, using AWS to easily spin up a database server.

Some notes:

The docs say that after SSL is set up, you have to do some weird stuff with the connection string, but this connection string is now working for me:

psql --host SOMEHOST.rds.amazonaws.com --username xxx --password --dbname xxx

So I think that server is set up and it's onwards to the CL server itself, to enable replication on it.

mlissner commented 5 years ago

OK, continuing along, we've got a pretty good tutorial from digital ocean:

https://www.digitalocean.com/community/tutorials/how-to-set-up-logical-replication-with-postgresql-10-on-ubuntu-18-04

A couple things to note:

  1. ~I think it's necessary to add users with access to the replication database in pg_hba.conf.~ This is a little weird since there is no replication database, but this just means to allow a user that wants to do replication. UPDATE: This is confusing, but no, replication permission is not needed. See more below.

  2. The DO instructions provide really wide permissions to the replication user. The fix for this is to create users that have very narrow ones. The requirements for this are laid out here:

    https://www.postgresql.org/docs/10/logical-replication-security.html

    Implementing this meant:

    CREATE ROLE testreplicator WITH REPLICATION LOGIN PASSWORD 'long-password';
    \c courtlistener
    GRANT SELECT ON TABLE search_opinion TO testreplicator;

    That creates the user (aka role) that can do replication and that can do SELECT on that one table. I'll post a command later with the full list of tables, but for now we're just doing some tests.

    The next thing that's needed is to create a publication. You can do that with a superuser, not with the replication user, so there's no need to give the CREATE privileges to the replication user, as described in the DO docs. Instead, as the superuser (postgresql) do:

    CREATE PUBLICATION testpublication FOR TABLE ONLY search_opinion;

    That should create a publication named testpublication that publishes only changes to that one table.

  3. The last steps are to migrate the schema to the replica and create a subscription on the replica. I'm working on these steps, but the biggest hurdle here is that we need to open port 5432 in our firewall. We did that already, but forwarded traffic to the wrong server, so...we're in the midst of taking another swing at this.

mlissner commented 5 years ago

Alright! That was very hard, but I've got a replication of search_opinion streaming to my laptop! This is awesome and it is a very good demonstration of the perils that lay ahead. If I turns off my laptop and go to bed now, as I so want to do, any changes that come in over night will be queued until I come back. If my laptop is stolen and never replaced, we'll just pile up records on the server for a good while. I think there are configurations that'd eventually start deleting data, but then the replica would never get it until it got a fresh dump of data.

To avoid the pile up, I need to remove my subscription. That shouldn't be a huge deal. Just have to do it with:

drop subscription testsubscription; 

Some other notes about what made this so hard:

  1. You can't authenticate until you've got pg_hba.conf properly configured. Alas, it has a configuration for "replication", but that config is for physical replication, not logical replication, like we're doing. If you try to use it, you'll get a dumb error message and you'll be stuck until somebody on IRC finally guesses the issue:

    That configuration that says 'replication'? That's for the other kind of replication.

    This is a documentation issue in postgresql and I've filed it for them to fix. I hope they will.

    In the meantime, this is an essential command for making changes to pg_hba.conf go live:

    sudo service postgresql reload
  2. Once you're authenticated, the only other hurdle is getting the schema mapped across. I was able to get the current schema for the CL database with:

    pg_dump -U postgres -Cs courtlistener > cl_schema.sql

    Then on the replica side, I needed to set up the correct credentials using:

    CREATE USER django WITH PASSWORD 'your-password' CREATEDB NOSUPERUSER;
    CREATE DATABASE courtlistener WITH OWNER django TEMPLATE template0 ENCODING 'UTF-8';
    GRANT ALL PRIVILEGES ON DATABASE courtlistener to django;

    Then I ran:

    sudo -H -u postgres psql < cl_schema.sql
  3. With the schema in place, I could create the subscription with:

    CREATE SUBSCRIPTION testsubscription CONNECTION 'host=some-domain-name.aws.com port=5432 password=xxx user=testreplicator dbname=courtlistener' PUBLICATION testpublication;

    Do not use an IP address for the connection since it will change.

There are a couple ways to monitor the replication, but I haven't found them super helpful. On the publisher you can run:

SELECT application_name,backend_start,sent_lsn,write_lsn,state,sync_state FROM pg_stat_replication ;

And on the subscriber you can run:

SELECT * FROM pg_stat_subscription;

What I did find useful was running:

sudo ncdu /var/lib/postgresql/10/main/

And watching my network monitor, which showed the data flowing in.

The only remaining things TBD are:

  1. How do you do this on an Amazon server?
  2. Is this approach fully secure? I think so and I've been cautious along the way, but I need to do some security review to figure out the risks. Worst case, we'll have to mitigate them via backups and clauses in the contracts.
mlissner commented 5 years ago

Good news, I've got this replicating for our first client. No major issues, just had to add them to the hba file, add the new role for their user, and then a new publication for them. On their AWS server, had to set it up for logical replication, then set up a subscriber.

mlissner commented 5 years ago

Note that logical replication uses "replication slots". According to the docs:

https://www.postgresql.org/docs/10/warm-standby.html#STREAMING-REPLICATION-SLOTS

Replication slots provide an automated way to ensure that the master does not remove WAL segments until they have been received by all standbys, and that the master does not remove rows which could cause a recovery conflict even when the standby is disconnected.

In lieu of using replication slots, it is possible to prevent the removal of old WAL segments using wal_keep_segments, or by storing the segments in an archive using archive_command. However, these methods often result in retaining more WAL segments than required, whereas replication slots retain only the number of segments known to be needed. An advantage of these methods is that they bound the space requirement for pg_wal; there is currently no way to do this using replication slots.

So we'll need monitoring of the disk usage. If a client server goes away without telling us, that will fill up our drive until we kill the publication.

Here's an example of this biting somebody and how they fixed it: https://stackoverflow.com/questions/47266359/out-of-space-due-to-wal-files

mlissner commented 5 years ago

One thing to note that can be super frustrating. When you create an RDS database with a default security group, you get a default rule that your IP can connect to that server. But, alas, none other can. This comes back to bite you later. I tell ya. The fix is to add specific rules for which servers can connect to the RDS server, which I've now done. The only IPs that can connect are our own and those of our clients.

mlissner commented 5 years ago

OK, finally launching this for our first client, with this on our replication server:

CREATE PUBLICATION opendata FOR TABLE ONLY people_db_abarating, people_db_attorney, people_db_attorneyorganization, people_db_attorneyorganizationassociation, people_db_criminalcomplaint, people_db_criminalcount, people_db_education, people_db_financialdisclosure, people_db_party, people_db_partytype, people_db_person, people_db_person_race, people_db_politicalaffiliation, people_db_position, people_db_race, people_db_retentionevent, people_db_role, people_db_school, people_db_source, search_citation, search_court, search_docket, search_docket_panel, search_docket_tags, search_docketentry, search_docketentry_tags, search_opinion, search_opinion_joined_by, search_opinioncluster, search_opinioncluster_non_participating_judges, search_opinioncluster_panel, search_opinionscited, search_originatingcourtinformation, search_recapdocument, search_recapdocument_tags, search_tag;

And create the role on the RDS publisher:

CREATE ROLE user WITH LOGIN PASSWORD 'long-password';
grant rds_replication to user;

And this on their subscriber:

CREATE SUBSCRIPTION opendata CONNECTION 'host=xxx port=5432 password=xxx user=xxx dbname=courtlistener' PUBLICATION opendata;
mlissner commented 5 years ago

And I also had to grant select to the replication user on the replica publisher:

grant select on table people_db_abarating, people_db_attorney, people_db_attorneyorganization, people_db_attorneyorganizationassociation, people_db_criminalcomplaint, people_db_criminalcount, people_db_education, people_db_financialdisclosure, people_db_party, people_db_partytype, people_db_person, people_db_person_race, people_db_politicalaffiliation, people_db_position, people_db_race, people_db_retentionevent, people_db_role, people_db_school, people_db_source, search_citation, search_court, search_docket, search_docket_panel, search_docket_tags, search_docketentry, search_docketentry_tags, search_opinion, search_opinion_joined_by, search_opinioncluster, search_opinioncluster_non_participating_judges, search_opinioncluster_panel, search_opinionscited, search_originatingcourtinformation, search_recapdocument, search_recapdocument_tags, search_tag TO XXXX;
mlissner commented 5 years ago

And if you want to see if data has come across the wire, you can also use this:

select schemaname, relname, n_live_tup from pg_stat_user_tables order by n_live_tup DESC;
mlissner commented 5 years ago

Note that https://github.com/freelawproject/courtlistener/issues/932 formalizes a lot of what's here (by design).

mlissner commented 5 years ago

The documentation for this feature is now live: https://www.courtlistener.com/api/replication/