Light cleanup on html_render_diff.py

There’s a stupendous amount of cleanup work I’ve been meaning to circle back and do in html_render_diff.py for YEARS. This is a start.

I plan to keep this PR relatively narrowly focused on refactoring, removing/replacing vestigial code, and style fixes. I’m not going to make significant behavior changes here (e.g. potentially changing the “spacer” concept, which needs a major rethink). Changes like that need a lot more careful consideration and testing, and I need time to get my head back into this space in order to do that well.

Work in progress. Still a little more to do here, although I don’t want to bit off too much. I want to:

[x] Look at removing _limit_spacers() and merging it into _insert_spacers() (good performance gain to be had there).
[ ] We tokenize in two steps that could potentially be one: flatten_el turns a DOM tree into an iterator of (token_type, extra_info) tuples. Then, fixup_chunks() turns those tuples into token objects to be diffed. https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/cb359af8d66a7e593a6a203eb4aeb5a0535bd179/web_monitoring_diff/html_render_diff.py#L767-L770

This was done back in the day when we used tokenization from inside lxml and added our own extras on top. We no longer do that, and it’s possible the code may be clearer if we combine these steps. It also might not be! I want to prototype this and see how it feels. (Important: I don’t think there’s a big performance gain to be had here, since the first step produces an iterator that we use pretty efficiently. But we could probably reduce some object allocations for the tuples, though.)
[ ] Consider fixing the double parsing we do when tokenizing…

diff_elements() calls _diffable_fragment(), which turns a soup DOM into a string, and passes that to _htmldiff(): https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/cb359af8d66a7e593a6a203eb4aeb5a0535bd179/web_monitoring_diff/html_render_diff.py#L518-L521

Then _htmldiff() calls tokenize() on that string: https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/cb359af8d66a7e593a6a203eb4aeb5a0535bd179/web_monitoring_diff/html_render_diff.py#L555-L556

Then tokenize() calls parse_html() on the string to make it an etree DOM: https://github.com/edgi-govdata-archiving/web-monitoring-diff/blob/cb359af8d66a7e593a6a203eb4aeb5a0535bd179/web_monitoring_diff/html_render_diff.py#L763-L768

One complexity here is that we start by dealing with a beatifulsoup DOM, then the lower-level code is designed to deal with an etree DOM. The structure of these two is not the same, and that may introduce issues. etree leaves spaces mostly untouched, but beautifulsoup does some space cleanup. etree elements also have a text and tail, while beautifulsoup deals with a mix of tags and strings (and other node types). This may be introduce enough complexity that this should be tackled separately. Needs examination.

Re: removing _limit_spacers(). This has a pretty big impact on DOMs with too many spacers, but not much of an impact otherwise. I was expecting the extra iteration and instance checking, etc. to be kind of expensive on large DOMs (that don’t have too many spacers), but it actually isn’t (I guess those DOMs just aren’t large enough to matter in the first place?). BUT once you start making too many spacers, this has an extremely noticeable performance impact.

So, there’s some value from that change, but only in the most extreme cases.

On the other hand, this suggests that a future where we remove the spacers altogether is also a future with much better overall performance than I’d expected. (That said, the actual diffing still takes the majority of the running time.)

edgi-govdata-archiving / web-monitoring-diff

Light cleanup on html_render_diff.py #145