`html_render_diff` can fail in diffing server if content has an invalid media type

Some content in EDGI’s Web Monitoring database occasionally fails html_render_diff with the error “a is not an HTML document” or “b is not an HTML document” even when we know both a and b are HTML documents.

One example is diffing these two versions:

That translates to the following request to the diffing server:

/html_token?a=https://edgi-wm-archive.s3.amazonaws.com/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855&b=https://edgi-wm-archive.s3.amazonaws.com/afd8aa1476462e5fbf1a698253be9c928384e41b0e482ac35ee33b9244597d81

I think what’s happening in this case is that the Content-Type header for one of the versions is malformed (the header is Content-Type: #<mime::nulltype:0x007f2a523499b8>; charset=utf-8) and that’s causing a problem when we try to check whether the content could be HTML in is_not_html(): https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/07b5d1e329cf387e6d2088232050d20b7f7b39d0/web_monitoring_diff/content_type.py#L45-L76

…but I haven’t checked in detail. It could also be something to do with the fact that the content is zero length.

edgi-govdata-archiving / web-monitoring-diff

`html_render_diff` can fail in diffing server if content has an invalid media type #75