Closed Mr0grog closed 2 years ago
Assuming the issue is the malformed header, it seems like we should just skip over the content-type header checking if the value is an invalid media type string, just like we do if the header isn’t set in the first place.
Checking validity might be as simple as looking for a /
in the value, or as complex as rigorously checking the syntax according to RFC 2045 and RFC 6838.
Some content in EDGI’s Web Monitoring database occasionally fails
html_render_diff
with the error “a
is not an HTML document” or “b
is not an HTML document” even when we know botha
andb
are HTML documents.One example is diffing these two versions:
That translates to the following request to the diffing server:
I think what’s happening in this case is that the
Content-Type
header for one of the versions is malformed (the header isContent-Type: #<mime::nulltype:0x007f2a523499b8>; charset=utf-8
) and that’s causing a problem when we try to check whether the content could be HTML inis_not_html()
: https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/07b5d1e329cf387e6d2088232050d20b7f7b39d0/web_monitoring_diff/content_type.py#L45-L76…but I haven’t checked in detail. It could also be something to do with the fact that the content is zero length.