Closed Mr0grog closed 6 years ago
The the issue on -versionista-scraper
for details on removing the metadata. It is basically removing content that matches this regex:
/\n?<!--\s*Versionista general\s*-->[^]*?<!--\s*End Versionista general\s*-->\n?/i
This was accomplished at the same time as edgi-govdata-archiving/web-monitoring-db#205. The quick-n-dirty script I used is in a gist for reference: https://gist.github.com/Mr0grog/2499576e144ec7e0e9256737d2f65a6e
Near the end of last year, Versionista changed the formatting of its metadata and started to fail at removing it from the raw data we store (edgi-govdata-archiving/web-monitoring-versionista-scraper#56). We fixed the issue, but it looks like I never went back and fixed up the bad data.
For every version we captured from Versionista between 2017-11-01 and 2017-12-15, we need to:
version_hash
field in the DBThis is a one-off task.