developer-docs: https://act-now-coalition.github.io/can-scrapers/index.html
conda
(either anaconda or miniconda)conda create -n can-scrapers python=3.10
conda activate can-scrapers
can-scrapers
directoryconda install fiona
pip install -e .
Our production database is an instance of PostgreSQL on google cloud SQL.
All of our SQL setup and interactions happen through sqlalchemy, which is (mostly) database engine agnostic.
For running integration tests locally there are two options:
pytest
operations from the local
directory (see training/training.org for more info on pytest options).CAN_PG_CONN_STR
to a proper PostgreSQL connection URI
before running pytest. Again see training/training.org for more infoIf you would like to work interactively in an IPython session or Jupyter
notebook, you can use the function can_tools.models.create_dev_engine
to set
up an in-memory SQLite instance
Below is a code snippet that sets this up, and then runs the Florida
scraper
and inserts data into the database
from can_tools.models import create_dev_engine
from can_tools.scrapers import Florida
# setup databsae
engine, Session = create_dev_engine()
scraper = Florida()
df = scraper.normalize(scraper.fetch())
scraper.put(engine, df)
Note that by default the create_dev_engine
routine will construct the database
in a verbose mode where all SQL commands are echoed to the console. We find that
this is helpful while debugging and developing. This can be disabled by passing
verbose=False
when calling create_dev_engine
.
Steps to set up VS code:
python
and pylance
Visual Studio Code extensionscan-tools
conda environment as the workspace interpreter.Please do not push any changes made to the .vscode
directory. That has some
shared settings, but will also be overwritten by the absolute path to the
conda environment on your machine. This path is unlikely to match exactly
with the path for any other team members
The scrapers in this repository are organized in the can_tools
python package
All scrapers are written in can_tools/scrapers
directory
If the resource to be scraped comes from an official source (like a government web page or
health department) then the scraper goes into can_tools/scrapers/official
. Inside the official
sub-directory there are many folders, each with the two letter abbreviation for a state. For
example, scrapers that extract data from North Carolina Deparment of Health are in
can_tools/scrapers/official/NC
Behind the scenes of every scraper written in can-tools
are abstract base
classes (ABC). These ABCs define abstract methods fetch
, normalize
,
and put
which must be implemented in order to create a scraper.
fetch
method is responsible for making network requests. It should request
the remote resource and do as little else as possible. When the resource is a csv
or json file, it is ok use pd.read_XXX
as the body of the fetch method. Other cases
might include the output of fetch being a requests.Response
object, or other.normalize
method should transform the output of fetch
page into scraped data
and return a DataFrame with columns (vintage, dt, location, category, measurement, unit, age, race, ethnicity, sex, value)
. See existing methods for
examplesput
method takes a SQL connection and a DataFrame and then puts the
DataFrame into the SQL database. This is taken care of by parent classes and
does not need to be updated manualllyMost scrapers will not require one to write the put methods because the generic methods should be able to dump the data into the database
All scrapers must inherit from DatasetBase
, but this typically happens by subclassing
a resource specific parent class like TableauDashboard
or ArcGIS
.
When creating a pull request various tests are performed on the code.
Occasionally, you might need to trigger a re-test without actually changing code
on your scraper. To achieve that goal, you can make an empty commit and then
push it, causing the on: pull requests
checks to be run.
git push