edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Shut Down & Archive Web Monitoring Projects #170

Open Mr0grog opened 1 year ago

Mr0grog commented 1 year ago

In #168, we ramped down to barebones maintenance and minimized what services we were running in production. That’s served the project well for the first half of 2023, but funding is drying up and it’s now time to shut down things entirely.

This does not apply to two subprojects that are actively used outside EDGI:

  1. wayback
  2. web-monitoring-diff

To Do:

Mr0grog commented 1 year ago

Quick updates:

Mr0grog commented 1 year ago

Re: combining content-addressed data into larger files, here are some stats on grouping by different length prefixes:

prefix_length groups count min count avg count max bytes min bytes avg bytes max
2 256 52,316 52,816 53,399 3,409,539.55kB 3,484,415.56kB 3,629,627.39kB
3 4,096 3,102 3,301 3,487 198,391.51kB 217,775.97kB 404,263.71kB
4 65,536 142 206 267 8,900.81kB 13,611.00kB 192,541.24kB
5 1,048,576 1 13 34 0.92kB 850.69kB 178,881.65kB

Note this doesn’t account for how big the files will be after compression (conservative guess is 25%-50% the bytes listed in the table).

I think that puts 3 as a good prefix length (large but manageable size files, and not too many of them, though still a lot). 2 might also be reasonable, depending on what we see for typical compression ratios (I think we should avoid files > 1 GB).

Viable formats:

  1. Zip. Widely supported, straightforward, and supports random access (unlike .tar.gz). Good-ish compression.
  2. SQLite Archive. Not as widely supported, but SQLite databases in general are (and this is just a particular database structure). Supports not just random access but all manner of fancy querying; could feasibly work with Datasette + datasette-media plugin). Definitely more complex than zips, though.
Mr0grog commented 1 year ago

Added some preliminary tooling for exporting the DB as a SQLite file at edgi-govdata-archiving/web-monitoring-db#1104. It's gonna be big (not sure how much, but my relatively puny local test DB is 46 MB raw, 5.5 MB gzipped), but this approach probably keeps it the most explorable for researchers. (Other alternatives here include gzipped NDJSON files, Parquet, Feather, or CSV [worst option IMO].)