edgi-govdata-archiving / web-monitoring-db

An HTTP API for tracking and annotating changes to a set of web pages.
https://api.monitoring.envirodatagov.org/
GNU General Public License v3.0
17 stars 26 forks source link

Remove bad old Versionista titles #1064

Closed Mr0grog closed 1 year ago

Mr0grog commented 1 year ago

It turns out we have some bad page titles from back when we got data from Versionista (discovered while working on #1061). When a page was missing a title, Versionista showed human-friendly text like "None", "No title available", etc. We incorrectly read these in as titles. This updates old version records from Versionista by replaces those titles with an empty string.

There may be additional strings we're missing here, but this is a big improvement (covers almost 32,000 records). Verified by looking for titles used across many versions, then sampling a few of the stored response bodies to make sure they didn't actually have these strings as literal titles.