MinnPost / minnpost-scraper-mn-election-results

A scraper for MN election results.
Other
8 stars 1 forks source link

MN Election Results

Flask-based scraper for Minnesota elections with an API that returns JSON data for display on our election result dashboard. Structurally, this application is based on this example, which itself is a restructuring of this example and its accompanying article.

Data structure

Data sources

Boundary data, for drawing maps and plotting locations, comes from Represent Minnesota. By default, it assumes we're using https://represent-minnesota.herokuapp.com but this is configurable by a .env value, BOUNDARY_SERVICE_URL.

Adding an election

Metadata about each election is managed in scraper_sources.json. Though there are often similarly named files for each election, there are usually files for each group of races and some can be named inconsistently.

Add a new object keyed by the date of the election, like YYYYMMDD. This should contain objects for results and other supplemental tables. There should be one entry per file needed to process.

"20140812": {
  "meta": {
    "date": "2014-08-12",
    "files_url": "ftp://media:results@ftp.sos.state.mn.us/20140812/",
    "primary": true
  },
  "us_house_results": {
    "url": "ftp://media:results@ftp.sos.state.mn.us/20140812/ushouse.txt",
    "table": "results",
    "type": "results",
    "results_scope": "us_house"
  },

In theory this should be it, assuming the scraper can reconcile everything. There is a good chance, though, that formatting changes could break the scraper, or that the scraper does not know how to fully process some results.

The current version of scraper_sources.json only works with this application as far back as the 20200303 key. Older elections run into scrape errors. Elections older than 2020 likely are using incorrect boundary sets due to redistricting.

Manual data

Both manual results and contest question text can be managed in Google Spreadsheets.

A good example of an election's JSON entry with manual data stored in a spreadsheet is:

"20211102": {
  "meta": {
    "base_url": "https://electionresultsfiles.sos.state.mn.us/20211102/",
    "date": "2021-11-02",
    "primary": false
  },
  [the standard entries],
  "raw_csv_supplemental_results": {
    "url": "https://s3.amazonaws.com/data.minnpost/projects/minnpost-mn-election-supplements/2021/Election+Results+Supplement+2021-11-02+-+Results.csv",
    "type": "raw_csv"
  },
  "raw_csv_supplemental_contests": {
    "url": "https://s3.amazonaws.com/data.minnpost/projects/minnpost-mn-election-supplements/2021/Election+Results+Supplement+2021-11-02+-+Contests.csv",
    "type": "raw_csv"
  },
  "supplemental_contests": {
    "spreadsheet_id": "1Jkt6UzHh-3h_sT_9VQ2GWu4It9Q96bQyL00j5_R0bqg",
    "worksheet_id": 0,
    "notes": "Worksheet ID is the zero-based ID from the order of workssheets and is used to find the actual ID."
  },
  "supplemental_results": {
    "spreadsheet_id": "1Jkt6UzHh-3h_sT_9VQ2GWu4It9Q96bQyL00j5_R0bqg",
    "worksheet_id": 1
  }
}

Google Sheets to JSON API setup

For both local and remote environments, you'll need to have access to an instance of the Google Sheets to JSON API that itself has access to the Google Sheet(s) that you want to process. If you don't already have access to a working instance of that API, set it up and ensure it's working first.

Credentials

To access the Google Sheets to JSON API you'll need to have two configuration values in your .env or in your Heroku settings.

Configuration

Use the following additional fields in your .env or in your Heroku settings.

Local setup and development

  1. Install git
  2. Get the code: git clone https://github.com/MinnPost/minnpost-scraper-mn-election-results.git
  3. Change the directory: cd minnpost-scraper-mn-election-results
  4. Create a .env file based on the repository's .env-example file in the root of your project.
  5. Run pipenv install.
  6. Open up three command line tabs if you need to run the scheduled scraping tasks as well as the API. In each tab, run pipenv shell. Check the Procfile in this repository for the commands that should be run.
  7. On the worker command, optionally include the -E flag to monitor task events that the worker receives.
  8. In the tab where you want to run the Flask-based API, instead of using gunicorn for local development, run flask run --host=0.0.0.0. This creates a basic endpoint server at http://0.0.0.0:5000.

Local setup for Postgres

This documentation describes how to install Postgres with Homebrew.

  1. Run brew install postgresql to install Postgres.
  2. Run psql postgres to start the server and log in to it.
  3. A free, Mac-based graphic manager for Postgres is Postbird.
  4. Create a database. For this example, call it election-scraper.
  5. Installing with Homebrew creates a user with no password. The connection string will be "postgresql://username:@localhost/election-scraper". Enter this connection string to the DATABASE_URL value of the .env file.
  6. To set up the database tables and columns without any data, run flask db upgrade in a command line.

To get the data for the database, you can also export it from Heroku.

Note: when the SQL structure changes, run flask db migrate and add any changes to the migrations folder to the Git repository.

See the scraper section below for commands to run after local setup is finished.

Local setup for Celery

This documentation describes how to install our Celery requirements with Homebrew.

  1. Run brew install redis to install Redis.
  2. By default, the Redis credentials are used like this: redis://127.0.0.1:6379/0. Replace 0 with another number if you are already using Redis for other purposes and would like to keep the databases separate. Whatever value you use, put it into the REDIS_URL value of your .env file.
  3. By default, this application uses Redis for the application cache and for the Celery backend. If you'd like to use something else for the Celery backend, add a different value to RESULT_BACKEND in your .env file.
  4. Run brew install rabbitmq to install RabbitMQ.
  5. By default, RabbitMQ credentials are used like this: amqp://guest:guest@127.0.0.1:5672. We store it in the CLOUDAMQP_URL .env value, as this matches Heroku's use of the CloudAMQP add-on.
  6. By default, this application uses CloudAMQP as the Celery broker. If you'd like to use something else, add a different value to the CELERY_BROKER_URL value.

Note: in a local environment, it tends to be fine to use Redis in place of RabbitMQ, but this does not work with Heroku's free Redis plan.

Note: if the application changes its task structure and Celery tries to run old tasks, run the celery purge command from within the application's virtualenv.

Production setup and deployment

Code, Libraries and prerequisites

This application should be deployed to Heroku. If you are creating a new Heroku application, clone this repository with git clone https://github.com/MinnPost/minnpost-scraper-mn-election-results.git and follow Heroku's instructions to create a Heroku remote.

Production setup for Postgres

Add the Heroku Postgres add-on to the Heroku application. The amount of data that this scraper uses will require at least the Hobby Basic plan. Heroku allows two applications to share the same database. They provide instructions for this.

To get the data into the database, you can either import it into Heroku, either from the included election-scraper-structure.sql file or from your database once it has data in it.

If you want to create an empty installation of the Flask database structure, or if the database structure changes and the changes need to be added to Heroku, run heroku run flask db upgrade. Flask's migration system will create all of the tables and relationships.

Run the scraper commands from the section below by following Heroku's instructions for running Python commands. Generally, run commands on Heroku by adding heroku run before the rest of the command listed below.

Production setup for Celery

Once the application is deployed to Heroku, Celery will be ready to run. To enable it, run the command heroku ps:scale worker=1. See Heroku's Celery deployment. To run the worker dyno as well, Heroku needs to be on a non-free plan.

Note: if the application changes its task structure and Celery tries to run old tasks, run the celery purge command from within the application's virtualenv.

Production setup for Redis and RabbitMQ

In the resources section of the Heroku application, add the Heroku Data for Redis and CloudAMQP add-ons. Unless we learn otherwise, the CloudAMWP should be able to use the free plan, while Heroku Data for Redis should be able to use the cheapest not-free plan.

Redis is used for caching data for the front end, and as the backend for Celery tasks. RabbitMQ is used as the broker for Celery tasks.

Scraping data

This application runs several tasks to scrape data from all of the data sources in the background. Whenever a scraper task runs, it will clear any cached data related to that task. In other words, the result scraper will clear any cached result queries. This is designed to keep the application from displaying cached data that is older than the newest scraped data.

On a Schedule

While the scraper's tasks can be run manually, they are designed primarily to run automatically at intervals, which are configurable within the application's settings.

The default scrape behavior is to run these scraper tasks based on the DEFAULT_SCRAPE_FREQUENCY configuration value (which is stored in seconds and defaults to 86400 seconds, or one day):

The default behavior is primarily designed to structure the data before an election occurs, although it may also catch changes when results are finalized.

There are multiple ways that the application can run the results task much more frequently. This is designed to detect the status of contests as results come in, for example on election night, whether all the results are in or not.

Set the start and end window as configuration values

To set an election return window by configuration values, use the ELECTION_DAY_RESULT_HOURS_START and ELECTION_DAY_RESULT_HOURS_END settings. Both of these values should be stored in a full datetime string such as "2022-08-23T00:00:00-0600".

If the application detects that the current time is between these start and end values, it will run the results task based on the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY configuration value, which is stored in seconds. See the .env-example and config.py files for how this value is set.

Use the election date from the scraper sources

If the ELECTION_DAY_RESULT_HOURS_START and ELECTION_DAY_RESULT_HOURS_END settings are not filled out, the plugin will look to the election data in the scraper_sources.json file. Each entry should have a date value, and the plugin will assume that date is the election date. From there, the application will use the ELECTION_DAY_RESULT_DEFAULT_START_TIME (this is midnight by default) and ELECTION_DAY_RESULT_DEFAULT_DURATION_HOURS (this defaults to 48 hours) values to determine a start and end value for election day behavior.

If the application detects that the current time is between these start and end values (for example, between 8pm on election day and 8pm the following day), it will run the results task based on the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY configuration value, which is stored in seconds. It defaults to run every 180 seconds, which is three minutes.

Use the override configuration value

This window detection behavior can be overridden by setting the ELECTION_RESULT_DATETIME_OVERRIDDEN configuration value. If it is set to "true", the results task will run according to the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY value, regardless of what day it is. If it is set to "false", the results task will run according to the DEFAULT_SCRAPE_FREQUENCY value, regardless of what day it is. Don't use either value in ELECTION_RESULT_DATETIME_OVERRIDDEN unless the current behavior specifically needs to be overridden; remove the setting after the override is no longer necessary.

In a Browser

To run the scraper in a browser, use the following URLs:

Note: ELECTION_DATE_OVERRIDE is an optional override configuration value that can be added to .env. The newest election will be used if not provided. If an override is necessary, the value should be the key of the object in the scraper_sources.json file; for instance 20140812.

By receiving parameters, the scraper URLs can limit what is scraped by the various endpoints. Each endpoint, unless otherwise noted, can receive data in GET, POST, and JSON formats. Unless otherwise noted, all scraper endpoints receive an optional election_id parameter. For example, [https://minnpost-mn-election-results.herokuapp.com/scraper/areas/?election_id=id-20211102].

Command line

this part is not done

Ideally, it would be good to make command line equivalents of the scraper URLs. Previously these commands were called:

  1. python code/scraper.py scrape areas <ELECTION_DATE>
  2. python code/scraper.py scrape questions <ELECTION_DATE>
  3. python code/scraper.py scrape match_contests <ELECTION_DATE>
  4. python code/scraper.py scrape results <ELECTION_DATE>

Accessing the API

The application's API returns the most recent data, in JSON format, that has been stored by the scraper tasks. Once an API endpoint has been requested, data is cached based on the API settings, and it is returned by the application until either the relevant scraper task runs again, or until the cache expires. The cache's default expiration is stored in seconds in the CACHE_DEFAULT_TIMEOUT configuration value. There is a separate value for the Google Sheet API's timeout, which is stored (also in seconds) in the PARSER_API_CACHE_TIMEOUT configuration value.

To access the scraper's data, use the following URLs. These URLs will return all of the contents of the respective models:

By receiving parameters, the API can limit what is returned by the various endpoints. Each endpoint, unless otherwise noted, can receive data in GET, POST, and JSON formats.

For all endpoints

Unless otherwise noted, all API endpoints can receive parameters with a "true" or "false" value to control cache behavior: bypass_cache, delete_cache, and cache_data.

Default values

SQL query

This endpoint returns the result of a valid select SQL query. For example, to run the query select * from meta, use the URL [https://minnpost-mn-election-results.herokuapp.com/api/query/?q=select%20*%20from%20meta]. This endpoint currently runs the legacy election dashboard on MinnPost, although ideally we will be able to replace it with proper calls to the SQL-Alchemy models.

This endpoint also accepts a callback parameter. If it is present, it returns the data as JavaScript instead of JSON, for use as JSONP. This is needed for the legacy election dashboard on MinnPost.

Areas

The Areas endpoint can receive area_id, area_group, and election_id parameters.

Contests and Contest Boundaries

The Contests and Contest Boundaries endpoints can both receive title, contest_id, contest_ids (for multiple contests), election_id, and address parameters.

Note: for address to work, there needs to be a valid MapQuest API key in the GEOCODER_MAPQUEST_KEY configuration value, as shown in .env-example.

Elections

The Elections endpoint can receive election_id and election_date parameters.

Questions

The Questions endpoint can receive a question_id, contest_id, and election_id parameters.

Results

The Results endpoint can receive result_id, contest_id, and election_id parameters.