edgi-govdata-archiving / web-monitoring-db

An HTTP API for tracking and annotating changes to a set of web pages.
https://api.monitoring.envirodatagov.org/
GNU General Public License v3.0
17 stars 26 forks source link

Add support for archiving DB to SQLite #1104

Open Mr0grog opened 1 year ago

Mr0grog commented 1 year ago

⚠️ Work in progress! ⚠️

This adds a rake command to export the contents of the DB into a SQLite file for public archiving. It's mostly a pretty straightforward copy of every table/row, but we skip tables that are irrelevant for a public data set (administrative things like GoodJob tables, users, imports, etc.), drop columns with user data, and do some basic conversions.

Part of edgi-govdata-archiving/web-monitoring#170

For changes/annotations, we probably want to just select relevant annotations, like the important changes (make sure we have them all in the DB first, see https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/main/web_monitoring/cli/annotations_import.py), and only import those and the changes they apply to.