Metro-Records / la-metro-dashboard

An Airflow-based dashboard for LA Metro
4 stars 0 forks source link

Refactor scripts to update both staging and production #23

Closed jeancochrane closed 4 years ago

jeancochrane commented 4 years ago

To update the staging and production databases at the same time, we'll need to adjust the scripts that power the scraping steps. In addition, our processing DAGs assume that only one database is configured by the DjangoOperator; we'll need some way of adjusting this to connect to both databases.

hancush commented 4 years ago

Thinking more about this, I don't think I want to do this for all of our scraping DAGs. Once we have a production Airflow instance, we only need to scrape/populate the staging database once: During the nightly scrape.

The main reason I wanted to populate both databases at once was to avoid duplicating the full scrape, which has thousands of calls.

IMO, the easiest way to achieve this would be to turn off all scraping DAGs on the staging Airflow instance and add a line to the full scrape script to populate the staging database, so that it's updated by the production Airflow instance. We could even do the staging import conditional on being in the production environment, so running the full scrape on the staging site doesn't touch the production database.

# on production, DATABASE_URL is postgis:///...lametro
# on staging, DATABASE_URL is postgis:///...lametro_staging
# 
# this approach would update the staging db and skip the second import on staging,
# and import to both databases in production
pupa update --scrape
pupa update --import DATABASE_URL
if production then pupa update DATABASE_URL_staging fi