edgi-govdata-archiving / web-monitoring-db

An HTTP API for tracking and annotating changes to a set of web pages.
https://api.monitoring.envirodatagov.org/
GNU General Public License v3.0
17 stars 26 forks source link

Add heuristics to identify error pages with 200 status codes #468

Closed Mr0grog closed 1 year ago

Mr0grog commented 5 years ago

Some sites have error pages that don’t respond with actual error status codes (i.e. they have a 200 status code instead of >= 400). For example: https://www.eia.gov/tools/models/datatools.cfm

Since we are now elevating status codes to a first-class value, it might be nice to have some heuristics for identifying these sorts of pages and assigning more accurate statuses to them. (Or even just a way for analysts to inform us of the situation when they run across one.)

I think the best place for this work is here in DB as a job, but it’s possible we should be putting it in -processing and -scraper instead.

I think there are probably a few other open questions here, too:

danielballan commented 5 years ago

Heresy: This is a finally problem that machine learning might actually be the right tool for.

Mr0grog commented 5 years ago

Need to cobble together some more, but one good example of this is https://www.epa.gov/climatechange/ (see it in the API at https://api.monitoring.envirodatagov.org/api/v0/pages/486340f9-3f4c-40bc-bd4d-d33ff020d609/versions). It redirects to https://www.epa.gov/sites/production/files/signpost/cc.html, which is effectively a 404, but returns a 200 status.

Mr0grog commented 4 years ago

The full page content definitely provides more useful clues, and I think @danielballan is right that this is actually one of the problems we have that’s well-suited to ML, BUT as a simpler, shorter-term solution, examining page titles for useful patterns might be useful.

I created a list of all page titles we’ve seen for versions with 200 status codes (or ones we don’t have status codes for) and posted it on Qri.cloud here: https://qri.cloud/mr0grog/web_monitoring_page_titles

A quick skim and search for interesting terms (e.g. “error”) found the following titles that represent error pages (I removed the many more that clearly aren’t errors):

403 disallowed
403 forbidden
404 - file not found | occupational safety and health administration
404 - file or directory not found.
404 error
404 not found
404 notice
404 page not found | fema.gov
404 | department of energy
access denied
access denied | bureau of safety and environmental enforcement
access denied | department of energy
access denied | science mission directorate
access denied | u.s. department of the interior
data center 404 error
eere: page not found
eia - sorry! unexpected error
error
error 404 | national centers for environmental information (ncei) formerly known as national climatic data center (ncdc)
error in assembly
error page 404 file not found | national-academies.org | where the nation turns for independent, expert advice
error | earthdata
error | iris | us epa
error: the requested url could not be retrieved
file not found | occupational safety and health administration
health & safety features archive<b>error processing ssi file</b><br>
help finding information | us epa
invalid url
ncep news [an error occurred while processing this directive]
ndbc - the page that you requested is not found
noaa esrl csd: file not found
nomads web interface: input error
not found
oracle access manager operation error
osha error message - 403 | occupational safety and health administration
page being updated | us epa
page has moved
page not found
page not found (404)
page not found - 404 | boem
page not found | bureau of land management
page not found | u.s. department of the interior
page not found | u.s. doe office of science (sc)
page not found | u.s. fish & wildlife service
page not found | u.s. fish &amp; wildlife service
page not found | us epa
page not found: error 404
page or resource not found (404 error) | national centers for environmental information (ncei)
page unavailable | whitehouse.gov
policies and regulations<b>error processing ssi file</b><br>
public health response to a changing climate<b>error processing ssi file</b><br>
query error
radar operations center - file not found
radar operations center - no access
request rejected
requested page not found (404)
resource not found | us environmental protection agency
restricted access | us epa
server error | federal highway administration
service is currently unavailable
sign up for our features<b>error processing ssi file</b><br>
temporary server error - 6
temporary server error - cmsb
western region errors page
your page was not found on the office of surface mining reclamation and enforcement website

And these might be good examples of things that look sort of error-ish but aren’t:

clean water act section 404(c) "veto authority" factsheet
state or tribal assumption of the cwa section 404 permit program | section 404 of the clean water act: permitting discharges of dredge or fill material | us epa
401 info e-mail correspondence | freedom of information act (foia) | us epa
ferc: eqr - 2nd quarter 2013 and earlier - refrences and help - error messages
osha website error report form | occupational safety and health administration
unauthorized use | bureau of land management

Then the full list would be good for testing heuristics against.

Mr0grog commented 3 years ago

Example case where title doesn’t help (we know there are probably plenty of these, but adding for posterity and gut-check-ness):

Mr0grog commented 1 year ago

See also some possible heuristics I wrote in https://github.com/edgi-govdata-archiving/web-monitoring-db/issues/751#issuecomment-689153077