Shut Down & Archive Web Monitoring Projects

Mr0grog commented 1 year ago

In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.

This does not apply to two subprojects that are actively used outside EDGI:

To Do:

[x] Stop the daily IA import cron job.
[x] Stop the daily IA healthcheck cron job (that checks whether our capturer over at IA is still running and capturing the URLs we care about) since it is no longer relevant.
[x] Make DB API write-only, shut down import worker.
- [x] Read-only functionality in DB: edgi-govdata-archiving/web-monitoring-db#1102
- [x] Make read-only in web-monitoring-ops
[ ] Investigate methods for archiving existing data. We have metadata about pages & versions (archived snapshots of a URL) in a Postgres database, raw response bodies in S3, and analyst reviews of changes in Google Sheets (not sure if we want to archive these or not).
- @gretchengehrke and I are talking with Internet Archive folks about ways to store things there, if possible/relevant.
- Alternately, I can look into gzipping or brotli encoding everything in S3 (~963.1 GB).
- We could also delete everything in S3 that is also available from the Internet Archive Wayback Machine.
- Everything is individual hash-addressed files. We may want to combine them into larger indexable blocks (maybe using some hash database format or something), since most files are HTML and relatively small (especially after compression, see above).
- ✅ / ❌ Do we want to save diffs from Versionista? Our archived data from them is not just response bodies, but also textual and HTML diffs, which probably aren’t as important. (Update: yes if we are just leaving things in the S3 buckets, no if not.)
- ❓ Do we want to archive and save analyst sheets or the important changes sheets? If so, look into publishing them as CSVs or as SQLite. (Update: Not high priority, but would be nice, especially if we are putting things in IA.)
- ❌ Do we want to save import requests? They contain the raw metadata that was imported, not just the current state of the DB. I think probably no, but worth considering. (OTOH, I’m pretty certain we don’t want to save import warning/error logs.) (Update: NO.)
[ ] Archive the data somewhere.
- If this is in a public space, get it done before replacing the UI & API with a tombstone page (so we can link it).
- Otherwise this is just some physical hard drives in people's possession.
[ ] Replace https://monitoring.envirodatagov.org/ and https://api.monitoring.envirodatagov.org/ with a tombstone page describing the project and its current status, where to find archives if publicly available, etc.
- This will probably be GitHub pages (maybe maintained in this repo).
- @gretchengehrke is working on copy for this.
[ ] Shut down all running services and resources in AWS.
[ ] Clean up dangling, irrelevant issues and PRs in all repos. PRs should generally be closed. I like to keep issues that someone forking the project might want to address open, but close others that would not be relevant in that context.
[ ] Update maintenance status notices if needed on repo READMEs.
- [ ] web-monitoring
- [ ] web-monitoring-ui
- [ ] web-monitoring-db
- [ ] web-monitoring-processing
- [ ] web-monitoring-ops
- [ ] web-monitoring-versionista-scraper
[ ] Archive all relevant repos.
- [ ] web-monitoring
- [ ] web-monitoring-ui
- [ ] web-monitoring-db
- [ ] web-monitoring-processing
- [ ] web-monitoring-ops
- [ ] web-monitoring-versionista-scraper

Mr0grog commented 1 year ago

Quick updates:

Still waiting for some feedback from IA folks about archiving there: what stuff they’ll accept, what formats, etc.
If we change the format of data in S3 or remove those buckets, we'll need to update the links in the public Enviro Fed Web Tracker Google Sheet.
Important changes sheets and analyst sheets would be nice, but the above “Enviro Fed Web Tracker” sheet is already something of a public version of the important changes sheet, so this is already kind of covered.
- These are also somewhat well preserved in Google Drive, so maybe that’s good enough.
- If we wind up able to store this kind of stuff in IA, that makes this more attractive.
Not worth synthesizing WARCs for content originally sourced from IA/Wayback Machine.
Yes worth synthesizing WARCs for content originally sourced from Versionista.
- Note: there are some version records in the database without archived response bodies from Versionista (these are from the very first stages of the project, where it was only meant to be a queryable index into Versionista, not an archive or backup).
- IA folks suggest warcio (Python) is the best tool for writing WARCs.

Mr0grog commented 1 year ago

Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:

prefix_length	groups	count min	count avg	count max	bytes min	bytes avg	bytes max
2	256	52,316	52,816	53,399	3,409,539.55kB	3,484,415.56kB	3,629,627.39kB
3	4,096	3,102	3,301	3,487	198,391.51kB	217,775.97kB	404,263.71kB
4	65,536	142	206	267	8,900.81kB	13,611.00kB	192,541.24kB
5	1,048,576	1	13	34	0.92kB	850.69kB	178,881.65kB

Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table).

I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB).

Viable formats:

Zip. Widely supported, straightforward, and supports random access (unlike .tar.gz). Good-ish compression.
SQLite Archive. Not as widely supported, but SQLite databases in general are (and this is just a particular database structure). Supports not just random access but all manner of fancy querying; could feasibly work with Datasette + datasette-media plugin). Definitely more complex than zips, though.

Mr0grog commented 1 year ago

Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].)

edgi-govdata-archiving / web-monitoring

Shut Down & Archive Web Monitoring Projects #170

To Do: