edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

HTML Diff: Where possible, diff regions of the page independently #5

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

I thought I’d written this idea down somewhere before, but cannot find it.

It might be nice to have the HTML differ use some heuristics to identify major regions of a page (e.g. main content [distinguised by tag, class name, etc.], headers, footers) and, if it can find the same region on both sides of the diff, diff each of those regions (and the regions between them) independently. This could help ensure, for example, that diffs in menus don't bleed together with diffs of the body content.

Having these heuristics around would also undoubtedly be useful in auto-classifying changes (e.g. changes only involved menus).

Mr0grog commented 6 years ago

Side note: it’s also possible this would help clean up diffs like a recent change to EPA’s page layout, where menus moved from after the body content to before it—depending on the particulars of the body content, we currently might identify the whole body as being changed or instead identify the whole menu as being changed. I imagine this technique of splitting up the diffing would ensure that this would always show menu as being the part that changed.

Example: https://monitoring.envirodatagov.org/page/2a2cd62e-ded7-4ecc-9749-804ea3e06a0d/9bfd1b57-5872-467f-b556-34cee538493a..0ef9613e-d38d-4aa2-97e7-01f39acf6f17

Mr0grog commented 5 years ago

Definitely still an idea worth working on.

Mr0grog commented 4 years ago

This Microsoft research paper covers an interesting way of using layout information to visually segment a page: https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/

Some things about it might be tough to incorporate easily, though:

Mr0grog commented 2 years ago

There is also some useful existing work based on real-world data from government sites at https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/main/analyst_sheets/normalize.py

Not nearly as generic as the MS paper referenced earlier, though.