edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

Simplified HTML Diff #11

Open Mr0grog opened 7 years ago

Mr0grog commented 7 years ago

I’ve talked with various people a few times about this idea, but wanted to make sure it actually got logged here so we don’t forget about it. It’s not high priority right now.

One of the reasons analysts use the “text only” diff is because:

The text diff makes this a little simpler because it strips all visual noise and surfaces all text on the page, regardless of CSS styling. However, it loses your sense of hierarchy and locational context on the page—changes stand out, but navigation and location-finding is hard. Text runs together.

What if we worked towards a middle ground here? Do an HTML-style diff, but do that diff on a “simplified/semantic” version of the page that strips most styling and scripting, but keeps intact semantic markup, like lists, headings, paragraphs, and so on, with some very minor standard styling. Think of this as diffing something like the readability view of a page instead of the page itself.

danielballan commented 7 years ago

That's a great suggestion, and it's something that could be worked on independently by an interesting contributor.

janakrajchadha commented 7 years ago

Think of this as diffing something like the readability view of a page instead of the page itself.

I think it is interesting to note that Readability is no longer available but its parsing API can still be used through Mercury (link on readability's site). We can also consider using the parsing results of other services like Pocket.

Mr0grog commented 7 years ago

its parsing API can still be used through Mercury (link on readability's site). We can also consider using the parsing results of other services like Pocket.

That might be worth looking at but, I think we are unlikely to find a ready made tool that focuses on the sort of things we care about here.

I’m not sure we want to throw out any textual/semantically meaningful content on the page (changes to menus and navigation are important to us, for example), while a central goal of Readability and Pocket is to throw away anything that doesn’t represent the primary article/body of the page. Mercury reader misses a lot of important stuff on this page, for example: https://energy.gov/oe/services/energy-assurance/emergency-preparedness/community-guidelines-energy-emergencies

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Definitely still a relevant idea!

jsnshrmn commented 5 years ago

I was just playing with pandoc to see if there was a very low effort implementation. I landed on a two step conversion: html -> markdown_strict -> html

e.g. pandoc -t markdown_strict https://www.energy.gov/ceser/community-guidelines-energy-emergencies | pandoc -s -t html

The resulting html is bare bones enough that it seems like we could put together a mostly good enough stylesheet for it pretty quickly.

If this is in the territory of what you were thinking, I could throw together a proof of concept pretty quickly.

Mr0grog commented 5 years ago

Hmmm, that feels like it could sort of be a start, but I was also angling towards something where we could still try and semantically identify important chunks of the page (e.g. separate out the header, footer, navigation from the main body). For the example you’ve got above, Pandoc strips out all the good indicators of most of that information — it happens to be a nicely structured page with <header>, <footer>, and <nav> tags, commonly used semantic class-names, etc., which we loose in the Pandoc-simplified version.

The first question I’d think to ask is: does using Pandoc this way get us much farther than checking the “remove formatting” option in the UI? (It didn’t exist when we originally wrote this.) Then: does this get us a more useful result (or maybe just a much faster result) than running this through BeautifulSoup (like we do for several other diffs) and stripping, say, <script>, <style>, and <link> tags?

If yes, maybe this is a good place to start, even if it’s not ultimately where we want to go with this diff. If no, maybe better to hold off until someone really has the time to devote to it in a more involved way.