DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

changes in whitespace count as diffs #14

Closed edsu closed 7 years ago

edsu commented 7 years ago

It looks like changes in whitespace in the readability text are showing up as diffs. Here's an example

screen shot 2017-01-18 at 11 06 29 am

Rather than doing a simple equality check perhaps there whitespace should be stripped somehow? Or we could calculate a diff each time?

ruebot commented 7 years ago

I've seen the same on a couple of my accounts.

...preliminary research into difflib is saying that it doesn't support ignoring whitespace. Does 'htmldiff` support anything like that? If not, maybe something like: if len >1 and contains only a regex of whitespace chars, then ignore?

ryanfb commented 7 years ago

Could we do this alongside the string normalization/regularization here? https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L149-L153

Also noticed it in the context of these tweets, not sure if wapo_diff is running a version that pre-dates 402cb06af0a4c29a27b62578edfa8f83527abba5 or this is still a bug:

https://twitter.com/wapo_diff/status/821793354052726784 https://twitter.com/wapo_diff/status/821793293474349056 https://twitter.com/wapo_diff/status/821793267285102592

edsu commented 7 years ago

It's funny I was just looking at the same thing and scratching my head about why those were getting tweeted. The code is up to date, but perhaps not against what's in the db... I will investigate.

edsu commented 7 years ago

I think it's time to move that text normalization into a function, and use it to diff against both what comes from the db (for old content that was cleaned differently) and current content.

edsu commented 7 years ago

@ryanfb you were right, due to an environment mixup I was running an older version. I had accidentally installed diffengine into a global pyenv environment that was masking the globally installed version. D'oh!

This explains the issue with the apostrophes you noticed. But the initial whitespace problem still needs to be addressed.

edsu commented 7 years ago

I definitely think normalization should happen on both sides: including what was previously stored in the db. I just accidentally launched a twitter storm of diffs when I updated the apostrophe handling code!