edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

`html_render_diff` can fail in diffing server if content has an invalid media type #75

Closed Mr0grog closed 2 years ago

Mr0grog commented 3 years ago

Some content in EDGI’s Web Monitoring database occasionally fails html_render_diff with the error “a is not an HTML document” or “b is not an HTML document” even when we know both a and b are HTML documents.

One example is diffing these two versions:

  1. https://api.monitoring.envirodatagov.org/api/v0/pages/c951dc75-bf76-4b3c-9b9b-4bb7294a9524/versions/1134a6dc-04d3-47a2-93b0-ba49aac28536 (content can be found at: https://edgi-wm-archive.s3.amazonaws.com/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)
  2. https://api.monitoring.envirodatagov.org/api/v0/pages/c951dc75-bf76-4b3c-9b9b-4bb7294a9524/versions/593eab94-3baf-40c0-af70-e4e68d512312 (content can be found at: https://edgi-wm-archive.s3.amazonaws.com/afd8aa1476462e5fbf1a698253be9c928384e41b0e482ac35ee33b9244597d81)

That translates to the following request to the diffing server:

/html_token?a=https://edgi-wm-archive.s3.amazonaws.com/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855&b=https://edgi-wm-archive.s3.amazonaws.com/afd8aa1476462e5fbf1a698253be9c928384e41b0e482ac35ee33b9244597d81

I think what’s happening in this case is that the Content-Type header for one of the versions is malformed (the header is Content-Type: #<mime::nulltype:0x007f2a523499b8>; charset=utf-8) and that’s causing a problem when we try to check whether the content could be HTML in is_not_html(): https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/07b5d1e329cf387e6d2088232050d20b7f7b39d0/web_monitoring_diff/content_type.py#L45-L76

…but I haven’t checked in detail. It could also be something to do with the fact that the content is zero length.

Mr0grog commented 3 years ago

Assuming the issue is the malformed header, it seems like we should just skip over the content-type header checking if the value is an invalid media type string, just like we do if the header isn’t set in the first place.

Checking validity might be as simple as looking for a / in the value, or as complex as rigorously checking the syntax according to RFC 2045 and RFC 6838.