Open Mr0grog opened 1 year ago
Quick updates:
version
records in the database without archived response bodies from Versionista (these are from the very first stages of the project, where it was only meant to be a queryable index into Versionista, not an archive or backup).Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:
prefix_length | groups | count min | count avg | count max | bytes min | bytes avg | bytes max |
---|---|---|---|---|---|---|---|
2 | 256 | 52,316 | 52,816 | 53,399 | 3,409,539.55kB | 3,484,415.56kB | 3,629,627.39kB |
3 | 4,096 | 3,102 | 3,301 | 3,487 | 198,391.51kB | 217,775.97kB | 404,263.71kB |
4 | 65,536 | 142 | 206 | 267 | 8,900.81kB | 13,611.00kB | 192,541.24kB |
5 | 1,048,576 | 1 | 13 | 34 | 0.92kB | 850.69kB | 178,881.65kB |
Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table).
I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB).
Viable formats:
.tar.gz
). Good-ish compression.Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].)
In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.
This does not apply to two subprojects that are actively used outside EDGI:
To Do:
[x] Stop the daily IA import cron job.
[x] Stop the daily IA healthcheck cron job (that checks whether our capturer over at IA is still running and capturing the URLs we care about) since it is no longer relevant.
[x] Make DB API write-only, shut down import worker.
[ ] Investigate methods for archiving existing data. We have metadata about pages & versions (archived snapshots of a URL) in a Postgres database, raw response bodies in S3, and analyst reviews of changes in Google Sheets (not sure if we want to archive these or not).
[ ] Archive the data somewhere.
[ ] Replace https://monitoring.envirodatagov.org/ and https://api.monitoring.envirodatagov.org/ with a tombstone page describing the project and its current status, where to find archives if publicly available, etc.
[ ] Shut down all running services and resources in AWS.
[ ] Clean up dangling, irrelevant issues and PRs in all repos. PRs should generally be closed. I like to keep issues that someone forking the project might want to address open, but close others that would not be relevant in that context.
[ ] Update maintenance status notices if needed on repo READMEs.
[ ] Archive all relevant repos.