NCVotes / voters-ingestor

Django app to create and configure Postgres DB, then fetch and process public voter registration and voter history files from N.C. Board of Elections
MIT License
10 stars 1 forks source link

Alternative approach #87

Closed vkurup closed 6 years ago

vkurup commented 6 years ago

Things were unbearably slow on production. After a lot of investigation, I think this boils down to 2 main problems:

  1. Django converts filter(data__party_cd='DEM') into WHERE data ->> 'party_cd' = 'DEM', rather than WHERE data @> "{'party_cd': 'DEM'}". The first one does not take advantage of the GIN index on data, while the second one does. So anytime we miss a materialized view, we get a very slow query.
  2. In order to efficiently add 'OR' queries, like we did for 'race_code', we needed to add a flag field for each of the permutations of flag combinations (up to 720, though almost always less). This increased the size of each row to the point that queries which were fast at the beginning of the migration, became 1-2 orders of magnitude slower by the end (even after vacuuming/analyzing).

In order to address both of these problems, I ran some trials of an approach (inspired by some of my discussions with Calvin), to use a single materialized view, optimized as much as possible for querying. So no JSON fields, and use the smallest fields possible ('A' instead of 'ACTIVE', and county_id instead of county_desc, for example). This view is used only for counts and to generate a list of IDs which can be used for the 'sample', at which point we join into the main table to get display data about each voter.

From this view, the longest queries now took 2-5 seconds, which was a huge improvement. On top of that, I've added a Query Cache table that caches the count for the various filters that we encounter which brings things down to 1-2 ms per query, and which is easy to refresh when we refresh the materialized view after each import.

This PR does the minimum to get that in place (using a feature flag: 'USE_SINGLE_QUERY_VIEW') without disrupting the other approach. I've deployed this to production. I'll work on a follow-up PR to add tests and remove the old approach, which will clear up a lot of data on production, possibly making it possible to downsize our RDS instance.