edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Flag pages that are dead/removed #122

Closed Mr0grog closed 5 years ago

Mr0grog commented 6 years ago

One really useful feature for analysts would be to flag pages that have been removed. There’s no objective or even canonical way to monitor this (after all, we’ve seen plenty of servers that respond erroneously with 404s when they are under heavy load and unable to return a page that exists, as well as intermittent 404s when a page disappears for only a few days). So for now we are thinking:

A removed page is one that has responded with a 404 continuously for 14 days.

How might we measure this?

  1. A script that repeatedly checks Wayback on a regular schedule to get the status of all the snapshots for a page over the past 14 days. Similar to the generalized ETL script we’re currently working on. We’d run it on a regular schedule, but probably not daily since it’s a big query. This would also require adding the ability to POST updates to a page in -db.

  2. Seeing no new, different versions added to -db over a 14 day period when the most recent version was a 404. -db could easily manage its own async job to check and update this on a regular schedule.

Neither of these is perfect — (1) would be reliable, but only for URLs that we have Wayback checking for us on a regular basis. For other URLs, all bets are off. (2) Would be much easier to implement and maintain, but doesn’t give us a way to distinguish between pages that are persistently 404s and pages that our sources (like Versionista, Wayback, and Walk) have stopped monitoring and whose last snapshot was a 404 (because a page could always come back, but if we’ve stopped monitoring, we won’t know that).

I’d tend towards (2). Even with its imperfections, I think it’s probably good enough, is way lower impact to measure, and way lower effort to implement. In the future we could even make it more reliable by making HEAD requests for the pages we think are removed to make sure they are still 404s. (That said, depending how Walk works out, it might make doing this easier.)

@danielballan @jsnshrmn any thoughts here?

Mr0grog commented 6 years ago

Useful addendum from @ericnost here — some pages are also persistent 403, e.g. https://www.epa.gov/ghgreporting/subpart-w-basic-information, and that would be nice to flag, too.

Not sure on the best way to handle these needs, but some thoughts:

Mr0grog commented 5 years ago

Had a good conversation with @danielballan about this just now. We talked through a variety of solutions, but where we landed:

  1. We should add a status field to Page objects and determine it by a two-step process (calculated over a window of the last N days): (edgi-govdata-archiving/web-monitoring-db#451)

    1. Is the status code successful (i.e. < 400) more than X% of the time? If yes, mark the page’s status as 200. (X should be a configurable threshold; we think this should probably be 50%-80%, but not sure.)
    2. Otherwise, find the the most common error status code (i.e. >= 400) during the time window and use that as the page’s status.
  2. It’s probably important to find a more detailed way to display that information on the page details view in the UI. In addition to the above status code, we should provide some kind of histogram, graph, red/yellow/green confidence, or something to explain whether a page is constantly flipping, or whether its status codes have otherwise varied a lot. (edgi-govdata-archiving/web-monitoring-ui#331)

In the mean time, I need to file issues for the above and run some queries to try and get an idea for what the right success/error threshold is (1.1 above).

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉