The problem is that the seed scripts are frozen in time for how the site was first created and archival data was migrated in. The seed scripts haven't really kept up with the site's evolution and definitely not with the site's production data.

Really what we need is a (nightly?) job that creates a db dump from production, strips it of non-public data, and publishes it to an S3 bucket. Then the seed script could download the latest version of the db dump and import that.

[ ] Create nightly cron job on Heroku to create db dump of production db
[ ] Save that db dump to an S3 bucket on our S3 account, set the permissions to world: read.
[ ] Create nightly (a couple hours later) cron job on Heroku to import the production db dump into the staging server's db.
[ ] Delete all users from db dump
[ ] Create a test user in the db dump with username of test and password of test (will have to validate: false when saving that user.
[ ] Delete all things where the publication status is anything but published, i.e. drafts. (Articles, Pages, Zines, etc… anything with a publication_status column in its table.)
[ ] Update seed script to download and import that published db dump, instead of from files in db/seeds

If it's easier / better to export the db data to a big JSON file (or a few JSON files, e.g. one per model, etc), instead of a proper pg dump, that's ok too. Though, I think if the pg dump can work, there are bunch of efficiencies that we gain for free. Namely, Heroku db imports on new apps and on staging.

Use cases for this seed data:

[ ] Developers, who want local seed data that matches production data
[ ] Heroku CI, when a review app is created for a new pull request on GitHub
[ ] Staging, if this existed, we could create a nightly cron job to refresh staging's data to match a scrubbed version of production

crimethinc / website

Improve seeds script (to remove seed data files from repo) #383

Use cases for this seed data: