Flask-based scraper for Minnesota elections with an API that returns JSON data for display on our election result dashboard. Structurally, this application is based on this example, which itself is a restructuring of this example and its accompanying article.
src/scraper/models.py
to see what is assumed.Boundary data, for drawing maps and plotting locations, comes from Represent Minnesota. By default, it assumes we're using https://represent-minnesota.herokuapp.com but this is configurable by a .env
value, BOUNDARY_SERVICE_URL
.
Metadata about each election is managed in scraper_sources.json
. Though there are often similarly named files for each election, there are usually files for each group of races and some can be named inconsistently.
Add a new object keyed by the date of the election, like YYYYMMDD
. This should contain objects for results and other supplemental tables. There should be one entry per file needed to process.
"20140812": {
"meta": {
"date": "2014-08-12",
"files_url": "ftp://media:results@ftp.sos.state.mn.us/20140812/",
"primary": true
},
"us_house_results": {
"url": "ftp://media:results@ftp.sos.state.mn.us/20140812/ushouse.txt",
"table": "results",
"type": "results",
"results_scope": "us_house"
},
In theory this should be it, assuming the scraper can reconcile everything. There is a good chance, though, that formatting changes could break the scraper, or that the scraper does not know how to fully process some results.
The current version of scraper_sources.json
only works with this application as far back as the 20200303
key. Older elections run into scrape errors. Elections older than 2020 likely are using incorrect boundary sets due to redistricting.
Both manual results and contest question text can be managed in Google Spreadsheets.
A good example of an election's JSON entry with manual data stored in a spreadsheet is:
"20211102": {
"meta": {
"base_url": "https://electionresultsfiles.sos.state.mn.us/20211102/",
"date": "2021-11-02",
"primary": false
},
[the standard entries],
"raw_csv_supplemental_results": {
"url": "https://s3.amazonaws.com/data.minnpost/projects/minnpost-mn-election-supplements/2021/Election+Results+Supplement+2021-11-02+-+Results.csv",
"type": "raw_csv"
},
"raw_csv_supplemental_contests": {
"url": "https://s3.amazonaws.com/data.minnpost/projects/minnpost-mn-election-supplements/2021/Election+Results+Supplement+2021-11-02+-+Contests.csv",
"type": "raw_csv"
},
"supplemental_contests": {
"spreadsheet_id": "1Jkt6UzHh-3h_sT_9VQ2GWu4It9Q96bQyL00j5_R0bqg",
"worksheet_id": 0,
"notes": "Worksheet ID is the zero-based ID from the order of workssheets and is used to find the actual ID."
},
"supplemental_results": {
"spreadsheet_id": "1Jkt6UzHh-3h_sT_9VQ2GWu4It9Q96bQyL00j5_R0bqg",
"worksheet_id": 1
}
}
For both local and remote environments, you'll need to have access to an instance of the Google Sheets to JSON API that itself has access to the Google Sheet(s) that you want to process. If you don't already have access to a working instance of that API, set it up and ensure it's working first.
To access the Google Sheets to JSON API you'll need to have two configuration values in your .env
or in your Heroku settings.
AUTHORIZE_API_URL = "http://0.0.0.0:5000/authorize/"
(wherever the API is running, it uses an authorize
endpoint)PARSER_API_KEY = ""
(a valid API key that is accepted by the installation of the API that you're accessing)Use the following additional fields in your .env
or in your Heroku settings.
PARSER_API_URL = "http://0.0.0.0:5000/parser/"
(wherever the API is running, it uses a parser
endpoint)OVERWRITE_API_URL = "http://0.0.0.0:5000/parser/custom-overwrite/"
(wherever the API is running, it uses a parser/custom-overwrite
endpoint)PARSER_API_CACHE_TIMEOUT = "500"
(this value is how many seconds the customized cache should last. 0
means it won't expire.)PARSER_STORE_IN_S3
(provide a "true" or "false" value to set whether the API should send the JSON to S3. If you leave this blank, it will follow the API's settings.)git
git clone https://github.com/MinnPost/minnpost-scraper-mn-election-results.git
cd minnpost-scraper-mn-election-results
.env
file based on the repository's .env-example
file in the root of your project.pipenv install
.pipenv shell
. Check the Procfile in this repository for the commands that should be run.-E
flag to monitor task events that the worker receives.flask run --host=0.0.0.0
. This creates a basic endpoint server at http://0.0.0.0:5000.This documentation describes how to install Postgres with Homebrew.
brew install postgresql
to install Postgres.psql postgres
to start the server and log in to it.election-scraper
."postgresql://username:@localhost/election-scraper"
. Enter this connection string to the DATABASE_URL
value of the .env
file.flask db upgrade
in a command line.To get the data for the database, you can also export it from Heroku.
Note: when the SQL structure changes, run flask db migrate
and add any changes to the migrations
folder to the Git repository.
See the scraper section below for commands to run after local setup is finished.
This documentation describes how to install our Celery requirements with Homebrew.
brew install redis
to install Redis.redis://127.0.0.1:6379/0
. Replace 0
with another number if you are already using Redis for other purposes and would like to keep the databases separate. Whatever value you use, put it into the REDIS_URL
value of your .env
file.RESULT_BACKEND
in your .env
file.brew install rabbitmq
to install RabbitMQ.amqp://guest:guest@127.0.0.1:5672
. We store it in the CLOUDAMQP_URL
.env
value, as this matches Heroku's use of the CloudAMQP
add-on.CloudAMQP
as the Celery broker. If you'd like to use something else, add a different value to the CELERY_BROKER_URL
value.Note: in a local environment, it tends to be fine to use Redis in place of RabbitMQ, but this does not work with Heroku's free Redis plan.
Note: if the application changes its task structure and Celery tries to run old tasks, run the celery purge
command from within the application's virtualenv.
This application should be deployed to Heroku. If you are creating a new Heroku application, clone this repository with git clone https://github.com/MinnPost/minnpost-scraper-mn-election-results.git
and follow Heroku's instructions to create a Heroku remote.
Add the Heroku Postgres
add-on to the Heroku application. The amount of data that this scraper uses will require at least the Hobby Basic
plan. Heroku allows two applications to share the same database. They provide instructions for this.
To get the data into the database, you can either import it into Heroku, either from the included election-scraper-structure.sql
file or from your database once it has data in it.
If you want to create an empty installation of the Flask database structure, or if the database structure changes and the changes need to be added to Heroku, run heroku run flask db upgrade
. Flask's migration system will create all of the tables and relationships.
Run the scraper commands from the section below by following Heroku's instructions for running Python commands. Generally, run commands on Heroku by adding heroku run
before the rest of the command listed below.
Once the application is deployed to Heroku, Celery will be ready to run. To enable it, run the command heroku ps:scale worker=1
. See Heroku's Celery deployment. To run the worker dyno as well, Heroku needs to be on a non-free plan.
Note: if the application changes its task structure and Celery tries to run old tasks, run the celery purge
command from within the application's virtualenv.
In the resources section of the Heroku application, add the Heroku Data for Redis
and CloudAMQP
add-ons. Unless we learn otherwise, the CloudAMWP
should be able to use the free plan, while Heroku Data for Redis
should be able to use the cheapest not-free plan.
Redis is used for caching data for the front end, and as the backend for Celery tasks. RabbitMQ is used as the broker for Celery tasks.
This application runs several tasks to scrape data from all of the data sources in the background. Whenever a scraper task runs, it will clear any cached data related to that task. In other words, the result scraper will clear any cached result queries. This is designed to keep the application from displaying cached data that is older than the newest scraped data.
While the scraper's tasks can be run manually, they are designed primarily to run automatically at intervals, which are configurable within the application's settings.
The default scrape behavior is to run these scraper tasks based on the DEFAULT_SCRAPE_FREQUENCY
configuration value (which is stored in seconds and defaults to 86400
seconds, or one day):
areas
: the areas for which elections happen. Counties, wards, precincts, school board districts, etc.elections
: the distinct elections periods. For example, the 2022 primary election.contests
: the distinct electoral contests. For example, the 2022 governor's race.questions
: ballot questions.results
: the results of an election that has occurred.The default behavior is primarily designed to structure the data before an election occurs, although it may also catch changes when results are finalized.
There are multiple ways that the application can run the results
task much more frequently. This is designed to detect the status of contests as results come in, for example on election night, whether all the results are in or not.
To set an election return window by configuration values, use the ELECTION_DAY_RESULT_HOURS_START
and ELECTION_DAY_RESULT_HOURS_END
settings. Both of these values should be stored in a full datetime string such as "2022-08-23T00:00:00-0600"
.
If the application detects that the current time is between these start and end values, it will run the results
task based on the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY
configuration value, which is stored in seconds. See the .env-example
and config.py
files for how this value is set.
If the ELECTION_DAY_RESULT_HOURS_START
and ELECTION_DAY_RESULT_HOURS_END
settings are not filled out, the plugin will look to the election data in the scraper_sources.json
file. Each entry should have a date
value, and the plugin will assume that date is the election date. From there, the application will use the ELECTION_DAY_RESULT_DEFAULT_START_TIME
(this is midnight by default) and ELECTION_DAY_RESULT_DEFAULT_DURATION_HOURS
(this defaults to 48 hours) values to determine a start and end value for election day behavior.
If the application detects that the current time is between these start and end values (for example, between 8pm on election day and 8pm the following day), it will run the results
task based on the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY
configuration value, which is stored in seconds. It defaults to run every 180
seconds, which is three minutes.
This window detection behavior can be overridden by setting the ELECTION_RESULT_DATETIME_OVERRIDDEN
configuration value. If it is set to "true"
, the results
task will run according to the ELECTION_DAY_RESULT_SCRAPE_FREQUENCY
value, regardless of what day it is. If it is set to "false"
, the results
task will run according to the DEFAULT_SCRAPE_FREQUENCY
value, regardless of what day it is. Don't use either value in ELECTION_RESULT_DATETIME_OVERRIDDEN
unless the current behavior specifically needs to be overridden; remove the setting after the override is no longer necessary.
To run the scraper in a browser, use the following URLs:
Note: ELECTION_DATE_OVERRIDE
is an optional override configuration value that can be added to .env
. The newest election will be used if not provided. If an override is necessary, the value should be the key of the object in the scraper_sources.json
file; for instance 20140812
.
By receiving parameters, the scraper URLs can limit what is scraped by the various endpoints. Each endpoint, unless otherwise noted, can receive data in GET
, POST
, and JSON formats. Unless otherwise noted, all scraper endpoints receive an optional election_id
parameter. For example, [https://minnpost-mn-election-results.herokuapp.com/scraper/areas/?election_id=id-20211102].
this part is not done
Ideally, it would be good to make command line equivalents of the scraper URLs. Previously these commands were called:
python code/scraper.py scrape areas <ELECTION_DATE>
python code/scraper.py scrape questions <ELECTION_DATE>
python code/scraper.py scrape match_contests <ELECTION_DATE>
python code/scraper.py scrape results <ELECTION_DATE>
The application's API returns the most recent data, in JSON format, that has been stored by the scraper tasks. Once an API endpoint has been requested, data is cached based on the API settings, and it is returned by the application until either the relevant scraper task runs again, or until the cache expires. The cache's default expiration is stored in seconds in the CACHE_DEFAULT_TIMEOUT
configuration value. There is a separate value for the Google Sheet API's timeout, which is stored (also in seconds) in the PARSER_API_CACHE_TIMEOUT
configuration value.
To access the scraper's data, use the following URLs. These URLs will return all of the contents of the respective models:
By receiving parameters, the API can limit what is returned by the various endpoints. Each endpoint, unless otherwise noted, can receive data in GET
, POST
, and JSON formats.
Unless otherwise noted, all API endpoints can receive parameters with a "true" or "false" value to control cache behavior: bypass_cache
, delete_cache
, and cache_data
.
bypass_cache
whether to load data from the cache. Defaults to "false".delete_cache
whether to delete existing cached data for this request. Defaults to "false".cache_data
whether to cache this request's response. Defaults to "true".This endpoint returns the result of a valid select
SQL query. For example, to run the query select * from meta
, use the URL [https://minnpost-mn-election-results.herokuapp.com/api/query/?q=select%20*%20from%20meta]. This endpoint currently runs the legacy election dashboard on MinnPost, although ideally we will be able to replace it with proper calls to the SQL-Alchemy models.
This endpoint also accepts a callback
parameter. If it is present, it returns the data as JavaScript instead of JSON, for use as JSONP
. This is needed for the legacy election dashboard on MinnPost.
The Areas endpoint can receive area_id
, area_group
, and election_id
parameters.
The Contests and Contest Boundaries endpoints can both receive title
, contest_id
, contest_ids
(for multiple contests), election_id
, and address
parameters.
Note: for address
to work, there needs to be a valid MapQuest API key in the GEOCODER_MAPQUEST_KEY
configuration value, as shown in .env-example
.
The Elections endpoint can receive election_id
and election_date
parameters.
The Questions endpoint can receive a question_id
, contest_id
, and election_id
parameters.
The Results endpoint can receive result_id
, contest_id
, and election_id
parameters.