edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
10 stars 3 forks source link

HTML diff should tokenize on some punctuation #6

Open Mr0grog opened 5 years ago

Mr0grog commented 5 years ago

This FTP diffing problem made me realize we should probably be splitting tokens in the HTML diff on periods (and maybe other punctuation?), not just on whitespace:

screen shot 2018-11-21 at 9 04 50 am

(Of course we don’t really want to use this differ on FTP listings, but that’s a different matter.)

This requires some care, though — we probably want to treat the periods as tokens themselves (in case they change), unlike whitespace. We’ve also talked about this before in terms of general punctuation handling — it would be really useful not only to split this way, but to tag and count punctuation changes separately from other changes. We might not prioritize a punctuation change for analysts to look at like we do a word change, and it would be nice to call out clearly that a change was merely in punctuation.

There are also punctuation changes we might want to treat extra special and even suppress in many cases. For example, changing to ' (apostrophe to prime) is a change we’ve seen before, and not one we generally care about.

Mr0grog commented 5 years ago

See also edgi-govdata-archiving/web-monitoring-processing#175

Frijol commented 5 years ago

Would this be a good good-first-issue label candidate?

Mr0grog commented 5 years ago

I wish it was, but the HTML diff is an incredibly horrifying mess, and nobody should try and screw with it unless they are ready for a lot of setbacks and a lot of WTFs. That is why it is not already marked with “help wanted.”