edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
10 stars 3 forks source link

HTML Diff: Re-examine minimum diff length and "spacer" technique #13

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

In the HTML diff, we have a minimum diff length of 2 tokens (inherited from LXML’s differ) and we also use this crazy-nuts “spacer” technique to try and break up over-eager runs of changes between major elements on the page.

Jake W recently pointed out this confusing change where menu items were getting highlighted even though nothing appears to have changed about them:

https://monitoring.envirodatagov.org/page/a52082c5-35c4-49c5-8ae3-d7ee48cded10/5d881a1a-9bfb-4da9-aaf2-be48c7b3a791..9e9f7171-dda6-410e-a0b7-1dc55116c023

screen shot 2018-05-17 at 9 52 26 am

But without styling, you can see that this is because hidden markup in those items was removed:

more-fun-with-menus

However, the spacer technique should be solving that (<li> tags are ones that we put spacers around). Not sure whether this is an example of the spacers not working correctly or if they’re being beaten out by the minimum length or something else entirely.

Mr0grog commented 6 years ago

Whatever we do here should also take a hard look at edgi-govdata-archiving/web-monitoring-processing#242, where I did rough fix to limit the number of spacer tokens we can add to a document before diffing.