edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

Light cleanup on html_render_diff.py #145

Open Mr0grog opened 1 year ago

Mr0grog commented 1 year ago

There’s a stupendous amount of cleanup work I’ve been meaning to circle back and do in html_render_diff.py for YEARS. This is a start.

I plan to keep this PR relatively narrowly focused on refactoring, removing/replacing vestigial code, and style fixes. I’m not going to make significant behavior changes here (e.g. potentially changing the “spacer” concept, which needs a major rethink). Changes like that need a lot more careful consideration and testing, and I need time to get my head back into this space in order to do that well.

Work in progress. Still a little more to do here, although I don’t want to bit off too much. I want to:

Mr0grog commented 1 year ago

Re: removing _limit_spacers(). This has a pretty big impact on DOMs with too many spacers, but not much of an impact otherwise. I was expecting the extra iteration and instance checking, etc. to be kind of expensive on large DOMs (that don’t have too many spacers), but it actually isn’t (I guess those DOMs just aren’t large enough to matter in the first place?). BUT once you start making too many spacers, this has an extremely noticeable performance impact.

So, there’s some value from that change, but only in the most extreme cases.

On the other hand, this suggests that a future where we remove the spacers altogether is also a future with much better overall performance than I’d expected. (That said, the actual diffing still takes the majority of the running time.)