edgi-govdata-archiving / web-monitoring-processing

Tools for access, "diff"-ing, and analyzing archived web pages
https://edgi-govdata-archiving.github.io/web-monitoring-processing
GNU General Public License v3.0
20 stars 20 forks source link

“FERC Calendar of Events” page fails html_token diff (side-by-side only) #143

Closed Mr0grog closed 6 years ago

Mr0grog commented 6 years ago

The diff raises the following exception:

HTTPServerRequest(protocol='http', host='localhost:8888', method='GET', uri='/html_token?format=json&include=all&a=https%3A%2F%2Fedgi-versionista-archive.s3.amazonaws.com%2Fversionista2%2F74286-6216116%2Fversion-14162513.html&a_hash=04dd4e2995e4276a06de74c1c4152253b03c2bd87889fc1a27ca291f74183115&b=https%3A%2F%2Fedgi-versionista-archive.s3.amazonaws.com%2Fversionista2%2F74286-6216116%2Fversion-14182760.html&b_hash=cc741ce9d740891a49943427509d229a8bb69ef5763b4a2247b033c29c1612f5', version='HTTP/1.1', remote_ip='::1', headers={'Accept-Encoding': 'gzip;q=1.0,deflate;q=0.6,identity;q=0.3', 'Accept': '*/*', 'User-Agent': 'Ruby', 'Connection': 'close', 'Host': 'localhost:8888'})
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/rbrackett/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/rbrackett/Dev/datarescue/web-monitoring-processing/web_monitoring/diffing_server.py", line 155, in caller
    return func(**kwargs)
  File "/Users/rbrackett/Dev/datarescue/web-monitoring-processing/web_monitoring/html_diff_render.py", line 265, in html_diff_render
    soup.head.append(change_styles)
AttributeError: 'NoneType' object has no attribute 'append'
Mr0grog commented 6 years ago

Looks like the page in question is woefully malformed. Here’s the beginning of the source:


  
<html>
<body>
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 
<head><title>
    FERC: Calendar of Events
</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><meta name="date" content="September 13, 2017 08:00:00 GMT">
...

So between that and the exception, it looks like there is simply no <head> element for Beautiful Soup to find here. I thought I’d tested that scenario, but clearly no!

Mr0grog commented 6 years ago

Ha! I totally did test it, but only before we made it possible to split the diff into separate insertion/deletion views: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/d647c53957fde542a3a4fdabc3335c2b5bd19051/web_monitoring/html_diff_render.py#L207-L210