edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Remove Versionista metadata from Versions 2017-11-01 through 2017-12-15 #95

Closed Mr0grog closed 6 years ago

Mr0grog commented 6 years ago

Near the end of last year, Versionista changed the formatting of its metadata and started to fail at removing it from the raw data we store (edgi-govdata-archiving/web-monitoring-versionista-scraper#56). We fixed the issue, but it looks like I never went back and fixed up the bad data.

For every version we captured from Versionista between 2017-11-01 and 2017-12-15, we need to:

  1. Remove metadata from S3 content and re-upload it to S3
  2. Update the version_hash field in the DB

This is a one-off task.

Mr0grog commented 6 years ago

The the issue on -versionista-scraper for details on removing the metadata. It is basically removing content that matches this regex:

/\n?<!--\s*Versionista general\s*-->[^]*?<!--\s*End Versionista general\s*-->\n?/i
Mr0grog commented 6 years ago

This was accomplished at the same time as edgi-govdata-archiving/web-monitoring-db#205. The quick-n-dirty script I used is in a gist for reference: https://gist.github.com/Mr0grog/2499576e144ec7e0e9256737d2f65a6e